Re: [PATCH] reiser4: fix readv

2006-09-17 Thread Andrew Morton

btw, please be aware that the post-2.6.17 deadlock fixes which went into
generic_file_buffered_write() caused a few problems:

a) Significant NFS overwrite performance regression (due to running
   prepare_write/commit_write) against each writev segment.

b) Smaller but significant performance regression for all writev()
   usages (I expect - haven't measured this yet).

c) It's still theoretically deadlockable.

I am (slowly) working on addressing these issues - more info later.

Basically, we will again need to be able to traverse multiple iovec
segments within a single prepare_write/commit_write instance.  This will
impact your batch_write patch, because that patch somewhat codifies in the
function layout the concept that we do one-segment-pre-prepare_write().

So that patch will need significant rework, I suspect.

What I'm hoping to be able to do is to simply revert the two deadlock-fix
patches (so we go back to the 2.6.17 code) and to remove
fault_in_pages_readable() and to simply unlock the page after
->prepare_write() and lock it again prior to ->commit_write().  But I
haven't tried that yet - there's quite a bit of preparatory work needed.


Re: [PATCH] reiser4: fix readv

2006-09-17 Thread Andrew Morton
On Sun, 17 Sep 2006 22:39:07 +0400
"Vladimir V. Saveliev" <[EMAIL PROTECTED]> wrote:

> Hello
> 
> On Friday 15 September 2006 06:24, Andrew Morton wrote:
> > On Wed, 13 Sep 2006 22:12:54 +0400
> > "Vladimir V. Saveliev" <[EMAIL PROTECTED]> wrote:
> > 
> > > Hello, Andrew
> > > 
> > > reiser4 in 2.6.18-rc6-mm2 has a bug. It can not do readv.
> > > 
> > > The attached patch fixes it by implementing reiser4' aio_read file 
> > > operation.
> > > Unfortunately, it appeared to get a loop which is very similar to the one 
> > > of
> > > fs/read_write.c:do_loop_readv_writev().
> > > Alternatively, if do_loop_readv_writev were EXPORT_SYMBOL-ed
> > > reiser4' aio_read could use it instead. But, there is a problem with 
> > > do_loop_readv_writev EXPORT_SYMBOL-ing:
> > > one if its arguments is io_fn_t, which is declared in fs/read_write.h.
> > > If it is ok to move io_fn_t and do_loop_readv_writev declarations to 
> > > include/linux/fs.h and to EXPORT_SYMBOL 
> > > do_loop_readv_writev the fix will be smaller. Please, let me know what 
> > > would you prefer.
> > > 
> > 
> > Yes, I'd say that do_loop_readv_writev() is suitable for exporting to
> > filesystems, and that doing so is preferable to duplicating it.
> > 
> > That'd be two patches, please: one to do the export and one to use it in
> > reiser4.
> > 
> > I assume there's a good reason why reiser4 cannot use
> > generic_file_aio_read() or vfs_readv().  Please capture that discussion in
> > the changelog for the first patch, thanks.
> > 
> 
> It seems the problem can be fixed a bit simpler. Currently, there is a 
> difference between read and readv: read calls f_op->read if it is defined,
> but readv calls f_op->read if f_op->aio_read is not defined. The latest is a 
> bit unlogical imho: 
> wouldn't it be more consistent if readv worked via f_op->read if it is 
> defined?
> If we fixed readv (do_readv_writev) that way - reiser4 would not need 
> anything from its aio_read but generic_file_aio_read.
> 
> 
> From: Vladimir Saveliev <[EMAIL PROTECTED]>
> 
> There is some asymmetry between read and readv:
> read (vfs_read, namely) calls f_op->read when it is defined,
> while readv (do_readv_writev) calls f_op->read when f_op->aio_read is not 
> defined.
> This patch makes do_readv_writev to call do_loop_readv_writev
> (which calls f_op->read for each segment of i/o vector) if f_op->read is 
> defined.
> 
> Signed-off-by: Vladimir Saveliev <[EMAIL PROTECTED]>
> 
> 
> 
> 
> diff -puN fs/read_write.c~fix-do_readv_writev fs/read_write.c
> --- linux-2.6.18-rc6-mm2/fs/read_write.c~fix-do_readv_writev  2006-09-15 
> 17:46:03.0 +0400
> +++ linux-2.6.18-rc6-mm2-vs/fs/read_write.c   2006-09-17 23:01:17.0 
> +0400
> @@ -608,7 +608,6 @@ static ssize_t do_readv_writev(int type,
>   if (ret)
>   goto out;
>  
> - fnv = NULL;
>   if (type == READ) {
>   fn = file->f_op->read;
>   fnv = file->f_op->aio_read;
> @@ -617,11 +616,11 @@ static ssize_t do_readv_writev(int type,
>   fnv = file->f_op->aio_write;
>   }
>  
> - if (fnv)
> + if (fn)
> + ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
> + else
>   ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
>   pos, fnv);
> - else
> - ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
>  
>  out:
>   if (iov != iovstack)
> 
> _

OK, thanks.  I'll wait to hear from Badari and Christoph on that.




Re: [PATCH] reiser4: fix readv

2006-09-14 Thread Andrew Morton
On Wed, 13 Sep 2006 22:12:54 +0400
"Vladimir V. Saveliev" <[EMAIL PROTECTED]> wrote:

> Hello, Andrew
> 
> reiser4 in 2.6.18-rc6-mm2 has a bug. It can not do readv.
> 
> The attached patch fixes it by implementing reiser4' aio_read file operation.
> Unfortunately, it appeared to get a loop which is very similar to the one of
> fs/read_write.c:do_loop_readv_writev().
> Alternatively, if do_loop_readv_writev were EXPORT_SYMBOL-ed
> reiser4' aio_read could use it instead. But, there is a problem with 
> do_loop_readv_writev EXPORT_SYMBOL-ing:
> one if its arguments is io_fn_t, which is declared in fs/read_write.h.
> If it is ok to move io_fn_t and do_loop_readv_writev declarations to 
> include/linux/fs.h and to EXPORT_SYMBOL 
> do_loop_readv_writev the fix will be smaller. Please, let me know what would 
> you prefer.
> 

Yes, I'd say that do_loop_readv_writev() is suitable for exporting to
filesystems, and that doing so is preferable to duplicating it.

That'd be two patches, please: one to do the export and one to use it in
reiser4.

I assume there's a good reason why reiser4 cannot use
generic_file_aio_read() or vfs_readv().  Please capture that discussion in
the changelog for the first patch, thanks.


> From: Vladimir Saveliev <[EMAIL PROTECTED]>
> 
> This patch adds implementation of aio_read file operation for reiser4.
> It is needed because in reiser4 there are files which can not be dealt
> with via generic page cache routines.
> In case of readv, reiser4 has no meaning to find out file type and to choose 
> proper
> way to read it. As result generic page cache read gets called for files which 
> can not be 
> read that way. Reiser4' aio_read method is to fix that problem. 
> 
> Signed-off-by: Vladimir Saveliev <[EMAIL PROTECTED]>
> 
> 
> 
> 
> diff -puN fs/reiser4/plugin/object.c~reiser4-add-aio_read 
> fs/reiser4/plugin/object.c
> --- linux-2.6.18-rc6-mm2/fs/reiser4/plugin/object.c~reiser4-add-aio_read  
> 2006-09-13 20:18:23.0 +0400
> +++ linux-2.6.18-rc6-mm2-vs/fs/reiser4/plugin/object.c2006-09-13 
> 20:18:23.0 +0400
> @@ -101,7 +101,7 @@ file_plugin file_plugins[LAST_FILE_PLUGI
>   .llseek = generic_file_llseek,
>   .read = read_unix_file,
>   .write = do_sync_write,
> - .aio_read = generic_file_aio_read,
> + .aio_read = aio_read_unix_file,
>   .aio_write = generic_file_aio_write,
>   .ioctl = ioctl_unix_file,
>   .mmap = mmap_unix_file,
> diff -puN fs/reiser4/plugin/file/file.c~reiser4-add-aio_read 
> fs/reiser4/plugin/file/file.c
> --- linux-2.6.18-rc6-mm2/fs/reiser4/plugin/file/file.c~reiser4-add-aio_read   
> 2006-09-13 20:18:23.0 +0400
> +++ linux-2.6.18-rc6-mm2-vs/fs/reiser4/plugin/file/file.c 2006-09-13 
> 20:52:30.0 +0400
> @@ -2011,6 +2011,54 @@ out:
> return result;
>  }
>  
> +/**
> + * aio_read_unix_file - aio_read of struct file_operations
> + * @iocb: i/o control block
> + * @iov: i/o vector
> + * @nr_segs: number of segments in the i/o vector
> + * @pos: file position to read from
> + *
> + * When it is called within reiser4 context (this happens when sys_read is
> + * reading a file built of extents) - just call generic_file_aio_read to
> + * perform read into page cache. When it is called without reiser4 context
> + * (sys_readv) - call read_unix_file for each segments of i/o vector, so that
> + * read_unix_file will be able to choose whether the file is to be read into
> + * page cache or the file is built of tail items and page cache read is not
> + * suitable for it.
> + */
> +ssize_t aio_read_unix_file(struct kiocb *iocb, const struct iovec *iov,
> +unsigned long nr_segs, loff_t pos)
> +{
> + ssize_t ret = 0;
> +
> + if (is_in_reiser4_context())
> + return generic_file_aio_read(iocb, iov, nr_segs, pos);
> +
> + while (nr_segs > 0) {
> + void __user *base;
> + size_t len;
> + ssize_t nr;
> +
> + base = iov->iov_base;
> + len = iov->iov_len;
> + iov++;
> + nr_segs--;
> +
> + nr = read_unix_file(iocb->ki_filp, base, len, &iocb->ki_pos);
> + if (nr < 0) {
> + if (!ret)
> + ret = nr;
> + break;
> + }
> + ret += nr;
> + if (nr != len)
> + break;
> + }
> +
> + return ret;
> +
> +}
> +
>  static ssize_t read_unix_file_container_tails(
>   struct file *file, char __user *buf, size_t read_amount, loff_t *off)
>  {
> diff -puN fs/reiser4/plugin/file/file.h~reiser4-add-aio_read 
> fs/reiser4/plugin/file/file.h
> --- linux-2.6.18-rc6-mm2/fs/reiser4/plugin/file/file.h~reiser4-add-aio_read   
> 2006-09-13 20:18:23.0 +0400
> +++ linux-2.6.18-rc6-mm2-vs/fs/reiser4/plugin/file/file.h  

Re: Reiser4 und LZO compression

2006-08-27 Thread Andrew Morton
On Sun, 27 Aug 2006 04:34:26 +0400
Alexey Dobriyan <[EMAIL PROTECTED]> wrote:

> The patch below is so-called reiser4 LZO compression plugin as extracted
> from 2.6.18-rc4-mm3.
> 
> I think it is an unauditable piece of shit and thus should not enter
> mainline.

Like lib/inflate.c (and this new code should arguably be in lib/).

The problem is that if we clean this up, we've diverged very much from the
upstream implementation.  So taking in fixes and features from upstream
becomes harder and more error-prone.

I'd suspect that the maturity of these utilities is such that we could
afford to turn them into kernel code in the expectation that any future
changes will be small.  But it's not a completely simple call.

(iirc the inflate code had a buffer overrun a while back, which was found
and fixed in the upstream version).




Re: [PATCH] reiserfs: eliminate minimum window size for bitmap searching

2006-08-22 Thread Andrew Morton
On Tue, 22 Aug 2006 19:46:34 -0700
Clay Barnes <[EMAIL PROTECTED]> wrote:

> Perhaps I mis-recall, but shouldn't delayed writes (or something along 
> those lines) cause a case where two files are being incrementally
> written rare?

If we did delayed allocation, yes.  But we generally don't.  (Exceptions:
XFS, reiser4, ext4, ext2 prototype circa 2001).


Re: [PATCH] reiserfs: eliminate minimum window size for bitmap searching

2006-08-22 Thread Andrew Morton

I can see that the bigalloc code as-is is pretty sad, but this is a scary
patch.  It has the potential to cause people significant performance
problems way, way ahead in the future.

For example, suppose userspace is growing two files concurrently.  It could
be that the bigalloc code would cause one file's allocation cursor to
repeatedly jump far away from the second.  ie: a beneficial side-effect. 
Without bigalloc that side-effect is lost and the two files blocks end up
all intermingled.

I don't know if that scenario is realistic, but I bet there are similar
accidental oddities which can occur as a result of this change.

But what are they?



Re: reiser4: maybe just fix bugs?

2006-08-01 Thread Andrew Morton
On Tue, 01 Aug 2006 15:24:37 +0400
"Vladimir V. Saveliev" <[EMAIL PROTECTED]> wrote:

> > >The writeout code is ugly, although that's largely due to a mismatch 
> > >between
> > >what reiser4 wants to do and what the VFS/MM expects it to do.
> 
> Yes. reiser4 writeouts atoms. Most of pages get into atoms via
> sys_write. But pages dirtied via shared mapping do not. They get into
> atoms in reiser4's writepages address space operation.

It think you mean ->writepage - reiser4 desn't implement ->writepages().

I assume you considered hooking into ->set_page_dirty() to do the
add-to-atom thing earlier on?

We'll merge mm-tracking-shared-dirty-pages.patch into 2.6.19-rc1, which
would make that approach considerably more successful, I expect. 
->set_page_dirty() is a bit awkward because it can be called under
spinlock.

Maybe comething could also be gained from the new
vm_operations_struct.page_mkwrite(), although that's less obvious...

> That is why
> reiser4_sync_inodes has two steps: on first one it calls
> generic_sync_sb_inodes to call writepages for dirty inodes to capture
> pages dirtied via shared mapping into atoms. Second step flushes atoms.
> 
> > >
> > I agree --- both with it being ugly, and that being part of why.
> > 
> > >  If it
> > >works, we can live with it, although perhaps the VFS could be made smarter.
> > >  
> > >
> > I would be curious regarding any ideas on that.  Next time I read
> > through that code, I will keep in mind that you are open to making VFS
> > changes if it improves things, and I will try to get clever somehow and
> > send it by you.  Our squalloc code though is I must say the most
> > complicated and ugliest piece of code I ever worked on for which every
> > cumulative ugliness had a substantive performance advantage requiring us
> > to keep it.  If you spare yourself from reading that, it is
> > understandable to do so.
> > 
> > >I'd say that resier4's major problem is the lack of xattrs, acls and
> > >direct-io.  That's likely to significantly limit its vendor uptake. 
> 
> xattrs is really a problem.

That's not good.  The ability to properly support SELinux is likely to be
important.


Re: [PATCH] reiserfs: fix handling of device names with /'s in them

2006-07-12 Thread Andrew Morton
On Wed, 12 Jul 2006 20:51:47 -0700
Hans Reiser <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> 
> >On Wed, 12 Jul 2006 12:42:28 -0400
> >Jeff Mahoney <[EMAIL PROTECTED]> wrote:
> >
> >  
> >
> >> On systems with block devices containing slashes (virtual dasd, cciss,
> >> etc), reiserfs will fail to initialize /proc/fs/reiserfs/ due to
> >> it being interpreted as a subdirectory. The generic block device code
> >> changes the / to ! for use in the sysfs tree. This patch uses that
> >> convention.
> >>
> >>
> >
> >Isn't it a bit dumb of us to be putting slashes in the device names anyway?
> > It would be better, if poss, to alter dasd/cciss/etc and stop all these
> >s@/@[EMAIL PROTECTED] games.
> >
> >
> >  
> >
> Isn't better to ask why there is a problem with the /'s?  It would be
> bad for Linux as a design to prevent passing arbitrary tail ends of
> filenames off to arbitrary plugins of some kind.  In general, in
> namespace design, you want to allow delegating the job of
> resolving/interpreting the tail end of a file that the front end has
> identified as something that can interpret it.
> 
> Forgive me, I probably understand something wongly about procfs and this
> issue
> 

It's a question of being *practical*.  Your observations are in the realm
of the theoretical.

Software sucks, and we get along better by not provoking it.  So don't put
spaces, let alone slashes into strings which we offer to userspace.


Re: [PATCH] reiserfs: fix handling of device names with /'s in them

2006-07-12 Thread Andrew Morton
On Wed, 12 Jul 2006 12:42:28 -0400
Jeff Mahoney <[EMAIL PROTECTED]> wrote:

>  On systems with block devices containing slashes (virtual dasd, cciss,
>  etc), reiserfs will fail to initialize /proc/fs/reiserfs/ due to
>  it being interpreted as a subdirectory. The generic block device code
>  changes the / to ! for use in the sysfs tree. This patch uses that
>  convention.

Isn't it a bit dumb of us to be putting slashes in the device names anyway?
 It would be better, if poss, to alter dasd/cciss/etc and stop all these
s@/@[EMAIL PROTECTED] games.


Re: [PATCH] reiserfs:fix journaling issue regarding fsync()

2006-07-02 Thread Andrew Morton
On Mon, 03 Jul 2006 10:31:49 +0900
Hisashi Hifumi <[EMAIL PROTECTED]> wrote:

> When write() extends a file(i_size is increased) and fsync() is called,
> change of inode must be written to journaling area through fsync().
> But,currently the i_trans_id is not correctly updated when i_size
> is increased. So fsync() does not kick the journal writer.
> 
> Reiserfs_file_write() already updates the transaction when blocks are 
> allocated,
> but the case when i_size increases and new blocks are not added is not
> correctly treated.

How can i_size be increased without adding blocks?  Are you referring to a
write which remains wholly within the final block of the file?  And/or to
an expanding lseek?

And what are the user-visible consequences of this bug?

Thanks.


Re: [PATCH 03/04] reiserfs: reorganize bitmap loading functions

2006-06-19 Thread Andrew Morton
Jeff Mahoney <[EMAIL PROTECTED]> wrote:
>
> + for (i = 0; i < SB_BMAP_NR(sb); i++)
> + bitmap[i].bh = reiserfs_read_bitmap_block(sb, i);

Can return NULL.

> + /* make sure we have them all */
> + for (i = 0; i < SB_BMAP_NR(sb); i++) {
> + wait_on_buffer(bitmap[i].bh);

Will go oops.


Re: [PATCH] Fix reiserfs deadlock

2006-04-21 Thread Andrew Morton
Jan Kara <[EMAIL PROTECTED]> wrote:
>
> reiserfs_cache_default_acl() should return whether we successfully found the
> acl or not. We have to return correct value even if reiserfs_get_acl() returns
> error code and not just 0. Otherwise callers such as reiserfs_mkdir() can
> unnecessarily lock the xattrs and later functions such as reiserfs_new_inode()
> fail to notice that we have already taken the lock and try to take it again
> with obvious consequences.
> 
> Signed-off-by: Jan Kara <[EMAIL PROTECTED]>
> 
> diff -rupX /home/jack/.kerndiffexclude 
> linux-2.6.5-SLES9_SP3_BRANCH/fs/reiserfs/xattr_acl.c 
> linux-2.6.5-SLES9_SP3_BRANCH-1-reiser_xattr_fix/fs/reiserfs/xattr_acl.c
> --- linux-2.6.5-SLES9_SP3_BRANCH/fs/reiserfs/xattr_acl.c  2006-01-21 
> 03:02:06.0 +0100
> +++ linux-2.6.5-SLES9_SP3_BRANCH-1-reiser_xattr_fix/fs/reiserfs/xattr_acl.c   
> 2006-01-21 09:09:04.0 +0100
> @@ -410,8 +410,10 @@ reiserfs_cache_default_acl (struct inode
>  acl = reiserfs_get_acl (inode, ACL_TYPE_DEFAULT);
>  reiserfs_read_unlock_xattrs (inode->i_sb);
>  reiserfs_read_unlock_xattr_i (inode);
> -ret = acl ? 1 : 0;
> + if (!acl || IS_ERR(acl))
> + return 0;
>  posix_acl_release (acl);
> + ret = 1;
>  }
>  
>  return ret;

Your email client did something ghastly to this patch.  Please fix it.

I typed it in again.  While I was there I structured it a bit more nicely.

--- 25/fs/reiserfs/xattr_acl.c~fix-reiserfs-deadlockFri Apr 21 15:15:10 2006
+++ 25-akpm/fs/reiserfs/xattr_acl.c Fri Apr 21 15:17:40 2006
@@ -408,8 +408,9 @@ int reiserfs_cache_default_acl(struct in
acl = reiserfs_get_acl(inode, ACL_TYPE_DEFAULT);
reiserfs_read_unlock_xattrs(inode->i_sb);
reiserfs_read_unlock_xattr_i(inode);
-   ret = acl ? 1 : 0;
-   posix_acl_release(acl);
+   ret = (acl && !IS_ERR(acl));
+   if (ret)
+   posix_acl_release(acl);
}
 
return ret;
_



Re: [PATCH] reiserfs: reiserfs_file_write will lose error code when a 0-length write occurs w/ O_SYNC

2006-03-02 Thread Andrew Morton

Boy, lots of reiserfs things happening lately.

We presently have:

reiserfs-do-not-check-if-unsigned-0.patch   [ merged today ]
reiserfs-fix-transaction-overflowing.patch
reiserfs-handle-trans_id-overflow.patch
reiserfs-reiserfs_file_write-will-lose-error-code-when-a-0-length-write-occurs-w-o_sync.patch
reiserfs-cleanups.patch
reiserfs-use-balance_dirty_pages_ratelimited_nr-in-reiserfs_file_write.patch
reiserfs-fix-unaligned-bitmap-usage.patch

The question is, which of these are sufficiently serious-and-safe for
2.6.16?

I haven't seen any resierfs bug reports for quite some time (except for the
usual dribble of it-goes-oops-in-prints.c-when-something-went-wrong
reports).

So I'm inclined to hold off on all the above?


Re: [Fwd: Re: [PATCH] reiserfs: use balance_dirty_pages_ratelimited_nr in reiserfs_file_write]

2006-03-02 Thread Andrew Morton
Hans Reiser <[EMAIL PROTECTED]> wrote:
>
> I suspect that when someone did the search and replace when creating
> balance_dirty_pages_ratelimited_nr they failed to read the code and
> realize this code path was already effectively ratelimited.  The result
> is they made it excessively infrequent (every 1MB if ratelimit is 8) in
> its calling balance_dirty_pages.

??  There's been no change to balance_dirty_pages_ratelimited().  I merely
widened the interface a bit: introduced the new
balance_dirty_pages_ratelimited_nr() and did

static inline void
balance_dirty_pages_ratelimited(struct address_space *mapping)
{
balance_dirty_pages_ratelimited_nr(mapping, 1);
}

That being said, if reiserfs has `number of pages' in hand then yes, it
makes sense to migrate over to balance_dirty_pages_ratelimited_nr().

> If anyone has ever seen this as an actual problem on a real system, I
> would be curious to hear of it.

No, I wouldn't expect it to make much difference.

All that gunk is there just to avoid calling the expensive
get_writeback_state() once per set_page_dirty().

Inaccuracy here will introduce the possibility that we'll transiently dirty
more memory than dirty_ratio permits, but it'll only be a little bit (eight
times the amount of memory which is dirtied per 
balance_dirty_pages_ratelimited()
call).

That's a small amount of memory.  But if you have 1000 filesystems mounted
and they all do the same thing at the same time, things could get a bit
sticky.  Your patch will (greatly) reduce the possibility of even that
scenario.



Re: latest patches degrade reiser4 performance substantially

2005-09-21 Thread Andrew Morton
Hans Reiser <[EMAIL PROTECTED]> wrote:
>
> At the this time we have no idea which patch is responsible, probably in
> a day or two we'll have a patch to fix it.
> 

OK.  I assume this performance change is demonstrable in just
2.6.14-rc2+reiser4?  Beware that there are other changes in the -mm lineup
which might cause regressions.  Notably

mm-try-to-allocate-higher-order-pages-in-rmqueue_bulk.patch

and

per-task-predictive-write-throttling-1.patch
per-task-predictive-write-throttling-1-tweaks.patch




Re: + reiser4-big-update-div64-fix.patch added to -mm tree

2005-09-16 Thread Andrew Morton
"Vladimir V. Saveliev" <[EMAIL PROTECTED]> wrote:
>
> Hello
> 
> [EMAIL PROTECTED] wrote:
> > The patch titled
> > 
> >  reiser4-big-update-div64-fix
> > 
> > has been added to the -mm tree.  Its filename is
> > 
> >  reiser4-big-update-div64-fix.patch
> > 
> > 
> > From: Andrew Morton <[EMAIL PROTECTED]>
> > 
> > There's no implementation of __div64_32().   Nor should there be.
> > 
> 
> hmm, there is one in linux/lib/div64.c.
> 

It's only for supporting compiler-generated calls on 32-bit platforms:

/* Not needed on 64bit architectures */
#if BITS_PER_LONG == 32

uint32_t __div64_32(uint64_t *n, uint32_t base)



Re: List of things requested by lkml for reiser4 inclusion (to review)

2005-09-09 Thread Andrew Morton
Hans Reiser <[EMAIL PROTECTED]> wrote:
>
> ...
>
> Do I remember right that the submission
> deadline is a week from Monday for 2.6.14 inclusion?

Next week, supposedly.

But something like a brand new filesystem can go in pretty much any time,
as long as it compiles.  Because it can't break anyone's current setup.

Brand new filesystems which also require monkeying with the core VFS aren't
quite that straightforward though.

> This is supposed to be a bullet list, so I don't list here the line by
> line minor code improvements sent to us, most of which were
> incorporated, but let me take a moment to thanks those who donated them.

Guys, you should know by now not to send 1021-column emails :(

> 1. pseudo files or "" files
> 

>   disabled.  It remains a point of (extraordinary) contention as to
> whether it can be fixed, we want to keep the code around until we can
> devote proper resources into proving it can be (or until we fail to prove
> it can be and remove it).  We don't want to delay the rest of the code for
> that proof, but we still think it can be done (by several different ways of
> which we need to select one and make it work.) Let us postpone contention
> on this until the existence of a patch that cannot crash makes contention
> purposeful, shall we?

I'd prefer that unused code simply not be present in the tree, sorry.

> 
> 
> 2. dependency on 4k stack turned off
> 
>removed as requested

So it all runs OK with 4k stacks now?

> 3. remove conditional variable code, use wait queues instead.
> 
> not done.  There are times when reduced functionality aids debugging. 
> kcond is (literally) textbook code.  We don't care enough to fight much for
> it, but akpm, what is your opinion?  Will remove if akpm asks us to.

kcond is only used in a couple of places.  One looks like it could use
complete() and the other is a standard wait-for-something-to-do kernel
thread loop, which we open-code without any fuss in lots of places
(kjournald, loop, pdflush, etc).  So yes, I'd be inclined to remove kcond
please.

Also, it would be better to use the kthread API rather than open-coding
kernel_thread() calls.  If you think that reiser4 needs additional ways of
controlling kernel threads then feel free to enhance the kthread API.

> 6.  remove type safe lists and type safe hash queues.
>
> not done, it is not clear that the person asking for this represents a
> unified consensus of lkml.  Other persons instead asked that it just be
> moved out of reiser4 code into the generic kernel code, which implies they
> did not object to it.  There are many who like being type safe.  Akpm, what
> do you yourself think?

The type-unsafety of existing list_heads gives me conniptions too.  Yes,
it'd be nice to have a type-safe version available.

That being said, I don't see why such a thing cannot be a wrapper around
the existing list_head functions.  Yes, there will be some ghastly
C-templates-via-CPP stuff, best avoided by not looking at the file ;)

We should aim for a complete 1:1 relationship between list_heads and
type-safe lists.  So people know what they're called, know how they work,
etc.  We shouldn't go adding things called rx_event_list_pop_back() when
everyone has learned the existing list API.

Of course, it would have been better to do this work as a completely
separate kernel feature rather than bundling it with a filesystem.  If this
isn't a thing your team wants to take on now then yes, I'd be inclined to
switch reiser4 to list_heads.


> 
> 7. remove fs/reiser4/lib.h:/div64_32.
> 
>is being replaced by the linux one.
> 
> 8.  Remove all assertions because they clutter the code and make it hard to 
> read
> 
> 
> We think this person was not an experienced security specialist,

We think this person didn't submit a patch to remove the 124 assertions
from ext3 ;)   Keep the assertions.



Re: reiser4 merge status

2005-09-07 Thread Andrew Morton
Hans Reiser <[EMAIL PROTECTED]> wrote:
>
> Andrew, we have been working on a big patch that resolves the VFS
> layering issues.  This patch should be ready to send in any day now, we
> are debugging it.  We may not have been sending you a lot of emails, but
> we have been working away at it.

That's good news.

>  I don't understand why you think the
> review feedback was lost.

Because I never saw the condensed list of open issues.

> I will have Zam review the email history, develop the bullet list for
> you, and send it.

OK, thanks, it's important.  Lots of people spent time looking over and
discussing the code.  They don't want to have to a) go back and re-read
everything to remember what their concerns were and b) scratch through the
new code to find out if their concerns were addressed, and how they were
addressed.

So I'd view Zam's list as pretty important.


Re: reiser4 merging action list

2005-06-27 Thread Andrew Morton
Hans Reiser <[EMAIL PROTECTED]> wrote:
>
> 
> Andrew asked me to put together a list of things that need to be done
> before merging:

Thanks.

As I said to Hans, if we can get a list of bullet-point actions nailed down
and agreed to then we have an uncontroversial path to happiness and a
merge.  Let's get down and concentrate on technical specifics.

Hans, please maintain this list and republish it as we work through things.

> * VFS will dispatch directly to the method of the plugin for the
> *_operations methods.  This requires duplicating to all plugin methods
> the common code currently used by all reiser4 plugins for a given
> method.  It has the desirable side effect of making the methods more
> fully self-contained, which is somethng I had wanted two years ago and
> was a little sad to not get, and the cost of duplicating some code. 
> Since not all plugin methods are *_operations, it means we have two
> structures with duplicated data, and duplicate data that must be in sync
> at all times is classical badness in programming technique (see Codd and
> normalization). vs owns this task
> 
> * review all sparse complaints, and revise as appropriate. 
> 
> * panic and code beauty: everyone agrees that having function, file,
> and line added to reiser4_panic output hurts nothing (I hope).  Everyone
> agrees that restarting the machine without an error message seems like a
> useless option to allow.   Much else was argued, not sure if anything
> was a consensus view.  Various detail improvements were suggested by
> Pecca, and I agreed with half of them.
> 
> 
>* metafiles should be disabled until we can present code that works
> right.  Half the list thinks we cannot solve the cycles problem ever. 
> Disable metafiles and postpone problem until working code, or the
> failure to produce it, makes it possible to do more than rant at each
> other.  This is currently already done in the -mm patches, but is
> mentioned lest someone think it forgotten.
> 
>* update the locking documentation
> 

There's also the custom list, hash and debug code.  We should either

a) remove them or

b) generify them and submit as standalone works or

c) justify them as custom-to-reiser4 and leave them as-is.




Re: -mm -> 2.6.13 merge status

2005-06-23 Thread Andrew Morton
Jeff Mahoney <[EMAIL PROTECTED]> wrote:
>
> >>+   assert("nikita-955", pool != NULL);
>  > 
>  > These assertion codes are meaningless to the rest of us so please drop
>  > them.
> 
>  As someone who spends time debugging reiser3 issues, I've grown
>  accustomed to the named assertions. They make discussing a particular
>  assertion much more natural in conversation than file:line.

__FUNCTION__?


Re: reiser4 plugins

2005-06-21 Thread Andrew Morton
Hans Reiser <[EMAIL PROTECTED]> wrote:
>
> What is wrong with having an encryption plugin implemented in this
>  manner?  What is wrong with being able to have some files implemented
>  using a compression plugin, and others in the same filesystem not.
> 
>  What is wrong with having one file in the FS use a write only plugin, in
>  which the encrypion key is changed with every append in a forward but
>  not backward computable manner, and in order to read a file you must
>  either have a key that is stored on another computer or be reading what
>  was written after the moment of cracking root?
> 
>  What is wrong with having a set of critical data files use a CRC
>  checking file plugin?

I think the concern here is that this is implemented at the wrong level.

In Linux, a filesystem is some dumb thing which implements
address_space_operations, filesystem_operations, etc.

Advanced features such as those which you describe are implemented on top
of the filesystem, not within it.  reiser4 turns it all upside down.

Now, some of the features which you envision are not amenable to
above-the-fs implementations.  But some will be, and that's where we should
implement those.



Re: reiser4 read performance loss of 17% in latest -mm kernel

2004-12-30 Thread Andrew Morton
Hans Reiser <[EMAIL PROTECTED]> wrote:
>
> Just an FYI, we are still figuring it out.  andrew, you might note it in 
> release notes, especially if we don't fix it before your next release.

I'd try reverting simplified-readahead*.patch.


Re: [Fwd: [PATCH] reiser4: fix a use after free bug in reiser4_parse_options]

2004-11-22 Thread Andrew Morton
Laurent Riffard <[EMAIL PROTECTED]> wrote:
>
> This patch fix a "use after kfree" bug in reiser4_parse_options.

I'll add that to -mm.  In future, please don't allow your email client to
wordwrap patches.  Thanks.



Re: [PATCH] Fix of quota deadlock on pagelock

2004-11-22 Thread Andrew Morton
Jan Kara <[EMAIL PROTECTED]> wrote:
>
> > 
> > What prevents vfs_quota_off() from racing with unmount?
> > 
> > If you look at, say, sync_filesystems() you'll see that we take ->s_umount
> > and then test ->s_root to check that we didn't race with an unmount
> > attempt.
>   I thought that the sync_fs and stuff is guarded by the fact that quota
> code is holding references to inodes on the filesystem. And invalidate_bdev()
> should not care about the filesystem. Or do I miss something? Anyway
> adding a comment about this is really a good idea...

Holding a ref on an inode will not cause umount to block.  It'll just cause
nasty "Busy inodes after unmount" warnings.




Re: [PATCH] Fix of quota deadlock on pagelock

2004-11-19 Thread Andrew Morton
Jan Kara <[EMAIL PROTECTED]> wrote:
>
> +static ssize_t ext2_quota_write(struct super_block *sb, int type,
> + const char *data, size_t len, loff_t off)
> +{
> + struct inode *inode = sb_dqopt(sb)->files[type];
> + unsigned long blk = off >> EXT2_BLOCK_SIZE_BITS(sb);
> + int err = 0, offset = off & (sb->s_blocksize - 1), tocopy;
> + size_t towrite = len;
> + struct buffer_head tmp_bh, *bh;
> +
> + down(&inode->i_sem);
> + while (towrite > 0) {
> + tocopy = sb->s_blocksize - offset < towrite ?
> + sb->s_blocksize - offset : towrite;
> +
> + tmp_bh.b_state = 0;
> + err = ext2_get_block(inode, blk, &tmp_bh, 1);
> + if (err)
> + goto out;
> + if (offset || tocopy != EXT2_BLOCK_SIZE(sb))
> + bh = sb_bread(sb, tmp_bh.b_blocknr);
> + else
> + bh = sb_getblk(sb, tmp_bh.b_blocknr);
> + if (!bh) {
> + err = -EIO;
> + goto out;
> + }
> + memcpy(bh->b_data+offset, data, tocopy);

It is possible to mmap block devices, so we should do a flush_dcache_page()
after the memset here.

Also, we should lock the buffer.  Because a concurrent read of the blockdev
would cause the disk DMA transfer to stomp all over the data whihc you just
wrote.  Say, someone is doing `cp /dev/hda1 /dev/null' at the same time.

So my accumulated patch against this one is:


diff -puN fs/ext2/super.c~fix-of-quota-deadlock-on-pagelock-ext2-tweaks 
fs/ext2/super.c
--- 25/fs/ext2/super.c~fix-of-quota-deadlock-on-pagelock-ext2-tweaksFri Nov 
19 14:54:47 2004
+++ 25-akpm/fs/ext2/super.c Fri Nov 19 15:00:59 2004
@@ -1020,10 +1020,13 @@ static ssize_t ext2_quota_read(struct su
   size_t len, loff_t off)
 {
struct inode *inode = sb_dqopt(sb)->files[type];
-   unsigned long blk = off >> EXT2_BLOCK_SIZE_BITS(sb);
-   int err = 0, offset = off & (sb->s_blocksize - 1), tocopy;
+   sector_t blk = off >> EXT2_BLOCK_SIZE_BITS(sb);
+   int err = 0;
+   int offset = off & (sb->s_blocksize - 1);
+   int tocopy;
size_t toread;
-   struct buffer_head tmp_bh, *bh;
+   struct buffer_head tmp_bh;
+   struct buffer_head *bh;
loff_t i_size = i_size_read(inode);
 
if (off > i_size)
@@ -1061,10 +1064,13 @@ static ssize_t ext2_quota_write(struct s
const char *data, size_t len, loff_t off)
 {
struct inode *inode = sb_dqopt(sb)->files[type];
-   unsigned long blk = off >> EXT2_BLOCK_SIZE_BITS(sb);
-   int err = 0, offset = off & (sb->s_blocksize - 1), tocopy;
+   sector_t blk = off >> EXT2_BLOCK_SIZE_BITS(sb);
+   int err = 0;
+   int offset = off & (sb->s_blocksize - 1);
+   int tocopy;
size_t towrite = len;
-   struct buffer_head tmp_bh, *bh;
+   struct buffer_head tmp_bh;
+   struct buffer_head *bh;
 
down(&inode->i_sem);
while (towrite > 0) {
@@ -1083,9 +1089,12 @@ static ssize_t ext2_quota_write(struct s
err = -EIO;
goto out;
}
+   lock_buffer(bh);
memcpy(bh->b_data+offset, data, tocopy);
+   flush_dcache_page(&bh->b_page);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
+   unlock_buffer(bh);
brelse(bh);
offset = 0;
towrite -= tocopy;
_


I suspect things in there are still a bit racy against a concurrent
`blockdev --flushbufs' though.



Re: [PATCH] Fix of quota deadlock on pagelock

2004-11-19 Thread Andrew Morton
Jan Kara <[EMAIL PROTECTED]> wrote:
>
> +static ssize_t ext2_quota_read(struct super_block *sb, int type, char *data,
> +size_t len, loff_t off)
> +{
> + struct inode *inode = sb_dqopt(sb)->files[type];
> + unsigned long blk = off >> EXT2_BLOCK_SIZE_BITS(sb);

`blk' should have sector_t type.

> + int err = 0, offset = off & (sb->s_blocksize - 1), tocopy;

Please do

int err = 0;
int offset = off & (sb->s_blocksize - 1);
int tocopy;

I'll make the above changes.


Re: [PATCH] Fix of quota deadlock on pagelock

2004-11-19 Thread Andrew Morton

What prevents vfs_quota_off() from racing with unmount?

If you look at, say, sync_filesystems() you'll see that we take ->s_umount
and then test ->s_root to check that we didn't race with an unmount
attempt.



Re:  2.6.9-mm1

2004-10-23 Thread Andrew Morton
"Hilzinger Marcel" <[EMAIL PROTECTED]> wrote:
>
> SuSE Linux 9.2 will contain reiser4

hm.  Nobody ever tells me anything.  Does that mean that
SuSE are using 8k stacks?


Re: silent semantic changes with reiser4

2004-08-26 Thread Andrew Morton
Spam <[EMAIL PROTECTED]> wrote:
>
>Secondly, do you expect file managers like Nautilus and Konqueror to
>support  every piece of file format on the planet so they could read
>information directly from the documents?

Sure.  You're proposing that we implement a single, golden multi-stream file
format in the kernel.

We could just as well do that in libmultistreamfileformat.so.

But I'll grant that one cannot go adding new metadata to, say, C files this
way.  I don't know how useful such a thing is though.

Remember that my main point is that there's a lack of coordination in
userspace.  Hell, there's none.  Putting it in-kernel forces that
coordination, and may be the way to go, but it's pretty sad.



Re: I oppose Chris and Jeff's patch to add an unnecessary additional namespace to ReiserFS

2004-04-26 Thread Andrew Morton
Hey, I was reading that!

Please do *not* go making modifications to Cc: lists.  Just do reply-to-all
and be happy, thanks.


Re: reiserfs logging patches udpated

2004-03-24 Thread Andrew Morton
Chris Mason <[EMAIL PROTECTED]> wrote:
>
> On Wed, 2004-03-24 at 17:35, Andrew Morton wrote:
> >  
> > > For data=ordered?  The only docs are to mount -o data=writeback if you
> > > don't want data=ordered (which is the new default).  No tool upgrades
> > > are required.
> > 
> > OK, thanks.  Switching the default on day one sounds radical doesn't it?
> > 
> It does, but the code has been in testing -suse and on the reiserfs
> list.  This dooesn't mean data=ordered is perfect, but it's not quite
> day one either.  I can switch the default back, but I'd rather have a
> trial by fire ;-)

OK.

> > > reiserfs needs a Documentation/filesystems/reiserfs.txt, but it needs
> > > that in general.  I'll write one up and have the namesys guys review.
> > 
> > OK.  And these guys need boring old changelogs please:
> > 
> The top of each patch has a boring old changelog.  I can reformat them
> if needed.

Oh, I didn't notice that.

Anything other than one-patch-per-email with changelog in the body is a bit
of a pain.

I'll go fetch the patches again :(


Re: reiserfs logging patches udpated

2004-03-24 Thread Andrew Morton
Chris Mason <[EMAIL PROTECTED]> wrote:
>
> On Wed, 2004-03-24 at 17:18, Andrew Morton wrote:
> > Chris Mason <[EMAIL PROTECTED]> wrote:
> > >
> > > ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.5-rc2-mm2
> > > 
> > > Has a new set of reiserfs patches. 
> > 
> > -ENODOCCO.  If people are going to test this stuff we will need setup and
> > usage instructions, links to userspace tool upgrades, etc, etc.
> 
> For data=ordered?  The only docs are to mount -o data=writeback if you
> don't want data=ordered (which is the new default).  No tool upgrades
> are required.

OK, thanks.  Switching the default on day one sounds radical doesn't it?

> reiserfs needs a Documentation/filesystems/reiserfs.txt, but it needs
> that in general.  I'll write one up and have the namesys guys review.

OK.  And these guys need boring old changelogs please:

reiserfs-nesting-02.patch
reiserfs-journal-writer.patch
reiserfs-logging.patch
reiserfs-jh-2.patch
reiserfs-prealloc.patch
reiserfs-tail-jh.patch
reiserfs-writepage-ordered-race.patch
reiserfs-file_write_hole_sd.diff.patch
reiserfs-laptop-mode.patch
reiserfs-truncate-leak.patch
reiserfs-ordered-lat.patch
reiserfs-dirty-warning.patch



Re: reiserfs logging patches udpated

2004-03-24 Thread Andrew Morton
Chris Mason <[EMAIL PROTECTED]> wrote:
>
> ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.5-rc2-mm2
> 
> Has a new set of reiserfs patches. 

-ENODOCCO.  If people are going to test this stuff we will need setup and
usage instructions, links to userspace tool upgrades, etc, etc.



Re: Filesystem Tests

2003-08-14 Thread Andrew Morton
Mike Fedyk <[EMAIL PROTECTED]> wrote:
>
> On Wed, Aug 06, 2003 at 06:34:10PM +0200, Diego Calleja Garc?a wrote:
>  > El Wed, 06 Aug 2003 18:06:37 +0400 Hans Reiser <[EMAIL PROTECTED]> escribi?:
>  > 
>  > > I don't think ext2 is a serious option for servers of the sort that 
>  > > Linux specializes in, which is probably why he didn't measure it.
>  > 
>  > Why?
> 
>  Because if you have a power outage, or a crash, you have to run the
>  filesystem check tools on it or risk damaging it further.
> 
>  Journaled filesystems have a much smaller chance of having problems after a
>  crash.

Journalled filesytems have a runtime cost, and you're paying that all the
time.

If you're going 200 days between crashes on a disk-intensive box then using
a journalling fs to save 30 minutes at reboot time just doesn't stack up:
you've lost much, much more time than that across the 200 days.

It all depends on what the machine is doing and what your max downtime
requirements are.


Re: Filesystem Tests

2003-08-05 Thread Andrew Morton
Grant Miner <[EMAIL PROTECTED]> wrote:
>
> I tested the performace of various filesystems with a mozilla build tree 
>  of 295MB, with primarily writing and copying operations.  The test 
>  system is Linux 2.6.0-test2, 512MB memory, 11531.85MB partition for 
>  tests.  Sync is run a few times throughout the test (for full script see 
>  bottom of this email).  I ran mkfs on the partition before every test. 
>  Running the tests again tends to produces similar times, about +/- 3 
>  seconds.
> 
>  The first item number is time, in seconds, to complete the test (lower 
>  is better).  The second number is CPU use percentage (lower is better).
> 
>  reiser4 171.28s, 30%CPU (1.x time; 1.0x CPU)
>  reiserfs 302.53s, 16%CPU (1.7663x time; 0.53x CPU)
>  ext3 319.71s, 11%CPU (1.8666x time; 0.36x CPU)
>  xfs 429.79s, 13%CPU (2.5093x time; 0.43x CPU)
>  jfs 470.88s, 6%CPU (2.7492x time 0.02x CPU)

But different filesystems will leave different amounts of dirty, unwritten
data in memory at the end of the test.  On your machine, up to 200MB of
dirty data could be sitting there in memory at the end of the timing
interval.  You need to decide how to account for that unwritten data in the
measurement.  Simply ignoring it as you have done is certainly valid, but
is only realistic in a couple of scenarios:

a) the files are about the be deleted again

b) the application which your benchmark simulates is about to spend more
   than 30 seconds not touching the disk.

This discrepancy is especially significant with ext3 which, in ordered data
mode, will commit all that data every five seconds.  If the test takes more
than five seconds then ext3 can appear to take a _lot_ longer.

But it is somewhat artificial: that data has to be written out sometime.

Solutions to this inaccuracy are to make the test so long-running (ten
minutes or more) that the difference is minor, or to include the `sync' in
the time measurement.


And when benching things, please include ext2.  It is the reference
filesystem, as it were.  It tends to be the fastest, too.



[reiserfs-list] Re: [patch] truncate_inode_pages

2001-06-14 Thread Andrew Morton

Dieter Nützel wrote:
> 
> Hello Andrew,
> 
> your patch is in 2.4.6-pre3 but do not appear in Alan's 2.4.5-ac14.

He'll pick it up next time he syncs with Linus.

> Is it not needed any longer since the latest VM fixes?

It's needed.  It doesn't fix any bugs - it just speeds
up some uncommon cases by few thousand times.