Re: [PATCH] LogFS take three

2007-05-17 Thread David Woodhouse
On Thu, 2007-05-17 at 15:12 +0900, Dongjun Shin wrote:
 The current trend of flash-based device is to hide the flash-specific details
 from the host OS. The flash memory is encapsulated in a package
 which contains a dedicated controller where a small piece of software (F/W or 
 FTL)
 runs and makes the storage shown as a block device to the host.

Yes. These things are almost always implemented _very_ badly by the same
kind of crack-smoking hobo they drag in off the streets to write BIOSen.

It's bog-roll technology; if you fancy a laugh try doing some real
reliability tests on them time some. Powerfail testing is a good one.
 
This kind of thing is OK for disposable storage such as in digital
cameras, where it doesn't matter that it's no more reliable than a
floppy disc, but for real long-term storage it's really a bad idea.

 IMHO, for a flash-optimized filesystem to be useful and widely-used, it would 
 be better
 to run on a block device and to be designed to run efficiently on top of the 
 FTL.
 (ex. log-structured filesystem on general block device)

There's little point in optimising a file system _specifically_ for
devices which in often aren't reliable enough to keep your data anyway.
You might as well use ramfs.

It's unfortunate really -- there's no _fundamental_ reason why FTL has
to be done so badly; it's just that it almost always is. Direct access
to the flash from Linux is _always_ going to be better in practice --
and that way you avoid the problems with dual journalling, along with
the problems with the underlying FTL continuing to keep (and copy around
during GC) sectors which the top-level filesystem has actually
deallocated, etc.

-- 
dwmw2

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-17 Thread Nick Piggin

David Howells wrote:

Nick Piggin [EMAIL PROTECTED] wrote:



You can drop the lock, do the invalidation,



Hmmm...  There's a danger of incurring a race by doing that.  Consider two
processes both trying to write to a dirty page for which writeback will be
rejected:

 (1) The first process gets EKEYREJECTED from the server, drops its lock and
 is then preempted.

 (2) The second process gets EKEYREJECTED from the server, drops its lock,
 truncates the page, reloads the page and modifies it.

 (3) The first process resumes and truncates the page, thereby splatting the
 second process's write.

Or:

 (1) The first process gets EKEYREJECTED from the server, clears the writeback
 information from the page, drops its lock and is then preempted.

 (2) The second process attaches its own writeback information to the page and
 modifies it.

 (3) The first process resumes and truncates the page, thereby splatting the
 second process's write.


If there are race issues involving concurrent invalidations, then you'd
fix that up by taking a filesystem specific lock to prevent them.

Generic write path should be holding i_mutex, but I don't think you can
take that from page_mkwrite... Just add one of your own.



Really, what I want to do is pass the page lock to truncate to deal with.
Better still, I want truncate to be selective, based on whether or not a page
is still associated with the rejected writeback.  I wonder if I should call
truncate_complete_page() or invalidate_complete_page() directly.


No, you shouldn't. We could theoretically introduce a new API for this,
but I think it would be preferable if you can fix the race in the fs.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-17 Thread Jan Engelhardt

On May 17 2007 09:42, Jeff Zheng wrote:

Problem is that is only happens when you actually write data to the
raid. You need the actual space to reproduce the problem.

That should not be a big problem. Create like 4x950G virtual sparse
drives (takes roughly or so 4x100 MB on the host after mkfs), then
set up a raid inside the vm, with which you can play - even write
to it. Not sure how RAID5 will affect the host size of the virtual
drives, since during first resync when all disks are blank, it will
XOR it (0^0=1), and hence fills up the host disk.


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Dongjun Shin

On 5/17/07, David Woodhouse [EMAIL PROTECTED] wrote:


Yes. These things are almost always implemented _very_ badly by the same
kind of crack-smoking hobo they drag in off the streets to write BIOSen.

It's bog-roll technology; if you fancy a laugh try doing some real
reliability tests on them time some. Powerfail testing is a good one.

This kind of thing is OK for disposable storage such as in digital
cameras, where it doesn't matter that it's no more reliable than a
floppy disc, but for real long-term storage it's really a bad idea.



There are so many flash-based storage and some disposable storages,
as you pointed out, have poor quality. I think it's mainly because these
are not designed for good quality, but for lowering the price.

These kind of devices are not ready for things like power failure because
their use case is far from that. For example, removing flash card
while taking pictures using digital camera is not a common use case.
(there should be a written notice that this kind of action is against
the warranty)



There's little point in optimising a file system _specifically_ for
devices which in often aren't reliable enough to keep your data anyway.
You might as well use ramfs.

It's unfortunate really -- there's no _fundamental_ reason why FTL has
to be done so badly; it's just that it almost always is. Direct access
to the flash from Linux is _always_ going to be better in practice --
and that way you avoid the problems with dual journalling, along with
the problems with the underlying FTL continuing to keep (and copy around
during GC) sectors which the top-level filesystem has actually
deallocated, etc.



There are, of course, cases where direct access are better.
However, as the demand for capacity, reliability and high performance
for the flash storage increases, the use of FTL with embedded controller
would be inevitable.

- The complexity/cost of host-side SW (like JFFS2/MTD) will increase due to
the use of multiple flash in parallel and the use of high degree ECC algorithm.
There are other things like bad block handling and wear-leveling issues
which has been recently touched by UBI with added SW complexity.

- In contrast to the embedded environment where CPU and flash is directly
connected, the I/O path between CPU and flash in PC environment is longer.
The latency for SW handshaking between CPU and flash will also be longer,
which would make the performance optimization harder.

As I mentioned, some techniques like log-structured filesystem could
perform generally better on any kind of flash-based storage with FTL.
Although there are many kinds of FTL, it is commonly true that
it performs well under workload where sequential write is dominant.

I also expect that FTL for PC environment will have better quality spec
than the disposable storage.

Dongjun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread David Woodhouse
On Thu, 2007-05-17 at 17:20 +0900, Dongjun Shin wrote:
 There are, of course, cases where direct access are better.
 However, as the demand for capacity, reliability and high performance
 for the flash storage increases, the use of FTL with embedded controller
 would be inevitable.
 
 - The complexity/cost of host-side SW (like JFFS2/MTD) will increase due to
 the use of multiple flash in parallel and the use of high degree ECC 
 algorithm.
 There are other things like bad block handling and wear-leveling issues
 which has been recently touched by UBI with added SW complexity.

You don't get rid of that complexity by offloading it to a µcontroller. 

The only thing you achieve that way is making sure it's quite likely to
be done badly, and making it impossible to debug.

 - In contrast to the embedded environment where CPU and flash is directly
 connected, the I/O path between CPU and flash in PC environment is longer.
 The latency for SW handshaking between CPU and flash will also be longer,
 which would make the performance optimization harder.

Do it the naïve way with a single byte push/pull and waggling the
control lines separately, and what you say is true -- but you can have
flash controllers which assist with data transfer but still give you
essentially 'raw' access to the chip.

With the CAFÉ controller designed for the OLPC machine, we can spew data
across the PCI bus just as fast as we can suck it off the flash chip.

 As I mentioned, some techniques like log-structured filesystem could
 perform generally better on any kind of flash-based storage with FTL.
 Although there are many kinds of FTL, it is commonly true that
 it performs well under workload where sequential write is dominant.

Yes, it's certainly possible that we _could_ write a file system which
is specifically targeted at FTL -- I was just wondering why anyone would
_bother_ :)

I've seen an interesting file system which does have a kind of FTL
internally as its lowest layer, and which build on that using 'virtual'
sectors for file extents. Now that _does_ have its advantages -- but it
doesn't go as far as pretending to be a 'normal' block device; it's its
own special thing for internal use within that file system.
 
 I also expect that FTL for PC environment will have better quality spec
 than the disposable storage.

There really is no reason why FTL has to be done badly; just as there's
no _reason_ why hardware vendors have to give us crappy bsVendorCode.
Nevertheless, that's the way the world tends to be. So good luck
shipping with that :)

-- 
dwmw2

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Pavel Machek

  My experience is that no matter which name I pick, 
 people will
  complain
  anyway.  Previous suggestions included:
  jffs3
  jefs
  engelfs
  poofs
  crapfs
  sweetfs
  cutefs
  dynamic journaling fs - djofs
  tfsfkal - the file system formerly known as logfs
 
 Can we call it jörnfs? :)
 
 However if Jörn is accused of murder, it will have 
 little chance of
 being merged :-).

Nah, it would lead to Jorn disappearing misteriously and _Pavel_
accused of murder ;-).

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread David Woodhouse
On Thu, 2007-05-17 at 09:12 +, Pavel Machek wrote:
 Nah, it would lead to Jorn disappearing misteriously and _Pavel_
 accused of murder ;-).

Are you suggesting that you would murder Jörn (you misspelled his name)
merely for the heinous crime of using his own name?

Your Luddism was already quite excessive, but now you really _are_
taking it to extremes. :)

-- 
dwmw2

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-17 Thread Neil Brown
On Thursday May 17, [EMAIL PROTECTED] wrote:
 XOR it (0^0=1), and hence fills up the host disk.

Uhmm... you need to check your maths.

$ perl -e 'printf %d\n, 0^0;'
0

:-)
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Arnd Bergmann
On Tuesday 15 May 2007, Jörn Engel wrote:
 
  I've been semi watching this, and the only comment I really can give
  is that I hate the name.  To me, logfs implies a filesystem for
  logging purposes, not for Flash hardware with wear leveling issues to
  be taken into account.
 
 Yeah, well, ...
 
 Two years ago when I started all this, I was looking for a good name.
 All I could come up with sounded stupid, so I picked LogFS as a code
 name.  As soon as I find a better name, the code name should get
 replaced.
 

When doing a google search on logfs, there are less than five results
among the first 100 that don't refer to your work. The other two listed
in there are also log-structured file systems: The inferno flash file
system (http://inferno-os.googlecode.com/svn/trunk/liblogfs/) and the
(discontinued) file system named lfs from the 2005 google summer of
code.

I'd say the name should stay, changing it now can only add more confusion.

Arnd 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc

2007-05-17 Thread Dave Kleikamp
On Thu, 2007-05-17 at 09:40 +1000, David Chinner wrote:
 On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote:
  On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote:

   Please don't make this always happen. c/mtime updates should be dependent
   on the mode being used and whether there is visible change to the file. 
   If no
   userspace visible changes to the file occurred, then timestamps should not
   be changed.
  
  i_blocks will be updated, so it seems reasonable to update ctime.  mtime
  shouldn't be changed, though, since the contents of the file will be
  unchanged.
 
 That's assuming blocks were actually allocated - if the prealloc range already
 has underlying blocks there is no change and so we should not be changing
 mtime either. Only the filesystem will know if it has changed the file, so I
 think that timestamp updates need to be driven down to that level, not done
 blindy at the highest layer

Yes, I agree.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jörn Engel
On Thu, 17 May 2007 16:43:59 +0800, David Woodhouse wrote:
 
  As I mentioned, some techniques like log-structured filesystem could
  perform generally better on any kind of flash-based storage with FTL.
  Although there are many kinds of FTL, it is commonly true that
  it performs well under workload where sequential write is dominant.
 
 Yes, it's certainly possible that we _could_ write a file system which
 is specifically targeted at FTL -- I was just wondering why anyone would
 _bother_ :)

Haven't you done that already?  JFFS2 write behaviour is the best-case
scenario for any FTL.  When the delta cache is finished, LogFS will be
pretty close to that as well.

Not sure if anyone would specifically target FTL.  Being well-suited for
those beasts is just a side-effect.

The FTL is still a net loss.  Without that FAT enabling layer a real
flash filesystem would be more efficient.

Jörn

-- 
Prosperity makes friends, adversity tries them.
-- Publilius Syrus
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc

2007-05-17 Thread Amit K. Arora
On Thu, May 17, 2007 at 09:40:36AM +1000, David Chinner wrote:
 On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote:
  On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote:
   On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote:
  
Following changes were made to the previous version:
 1) Added description before sys_fallocate() definition.
 2) Return EINVAL for len=0 (With new draft that Ulrich pointed to,
posix_fallocate should return EINVAL for len = 0.
 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
 4) Do not return ENODEV for dirs (let individual file systems decide if
they want to support preallocation to directories or not.
 5) Check for wrap through zero.
 6) Update c/mtime if fallocate() succeeds.
   
   Please don't make this always happen. c/mtime updates should be dependent
   on the mode being used and whether there is visible change to the file. 
   If no
   userspace visible changes to the file occurred, then timestamps should not
   be changed.
  
  i_blocks will be updated, so it seems reasonable to update ctime.  mtime
  shouldn't be changed, though, since the contents of the file will be
  unchanged.
 
 That's assuming blocks were actually allocated - if the prealloc range already
 has underlying blocks there is no change and so we should not be changing
 mtime either. Only the filesystem will know if it has changed the file, so I
 think that timestamp updates need to be driven down to that level, not done
 blindy at the highest layer

Ok. Will make this change in the next post.

--
Regards,
Amit Arora
 
 Cheers,
 
 Dave.
 -- 
 Dave Chinner
 Principal Engineer
 SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-17 Thread David Howells
Nick Piggin [EMAIL PROTECTED] wrote:

 No, you shouldn't. We could theoretically introduce a new API for this,
 but I think it would be preferable if you can fix the race in the fs.

Actually, I might be able to do better.

When making a StoreData call to the AFS server, I send all the parameters
first, and at that point, the server will abort it, I think, if permission is
not available, and won't wait for the payload to be delivered.

So if I tell AF_RXRPC to send the parameter data with an explicit ACK request
and then wait till it's either hard-ACK'd or aborted, I should then be able to
deal with the permissions failure at a state where I have locked *all* the
pages to be sent.

At that point, I should be able to tell truncate to simple discard all these
locked pages.

How's that sound?

David
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] file capabilities: Introduction

2007-05-17 Thread Serge E. Hallyn
Quoting Suparna Bhattacharya ([EMAIL PROTECTED]):
 On Mon, May 14, 2007 at 08:00:11PM +, Pavel Machek wrote:
  Hi!
  
   Serge E. Hallyn [EMAIL PROTECTED] wrote:
   
Following are two patches which have been sitting for some time in -mm.
   
   Where some time == nearly six months.
   
   We need help considering, reviewing and testing this code, please.
  
  I did quick scan, and it looks ok. Plus, it means we can finally start
  using that old capabilities subsystem... so I think we should do it.
 
 FWIW, I looked through it recently as well, and it looked reasonable enough
 to me, though I'm not a security expert. I did have a question about
 testing corner cases etc, which Serge has tried to address.
 
 Serge, are you planning to post an update without STRICTXATTR ? That should
 simplify the second patch.

Sorry, I did but I guess I didn't cc: you on that reply.

It is at http://lkml.org/lkml/2007/5/14/276

thanks,
-serge
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6][TAKE4] fallocate system call

2007-05-17 Thread Amit K. Arora
Description:
---
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.

Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.

Interface:
-
The proposed system call's layout is:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

fd: The descriptor of the open file.

mode*: This specifies the behavior of the system call. Currently the
  system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
  FA_ALLOCATE: Applications can use this mode to preallocate blocks to
a given file (specified by fd). This mode changes the file size if
the preallocation is done beyond the EOF. It also updates the
ctime in the inode of the corresponding file, marking a
successfull allocation.
  FA_DEALLOCATE: This mode can be used by applications to deallocate the
previously preallocated blocks. This also may change the file size
and the ctime/mtime.
* New modes might get added in future. One such new mode which is
  already under discussion is FA_PREALLOCATE, which when used will
  preallocate space but will not change the filesize and [cm]time.
  Since the semantics of this new mode is not clear and agreed upon yet,
  this patchset does not implement it currently.

offset: This is the offset in bytes, from where the preallocation should
  start.

len: This is the number of bytes requested for preallocation (from
  offset).
 
RETURN VALUE: The system call returns 0 on success and an error on
failure. This is done to keep the semantics same as of
posix_fallocate(). 

sys_fallocate() on s390:
---
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.

Known Problem:
-
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
-page_mkwrite() to solve this. See:
http://lkml.org/lkml/2007/5/8/583

Since there is a talk of -fault() replacing -page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:
http://lkml.org/lkml/2007/3/7/161
http://lkml.org/lkml/2007/3/18/198

ToDos:
-
1 Implementation on other architectures (other than i386, x86_64,
ppc64 and s390(x)). David Chinner has already posted a patch for ia64.
2 A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3 Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Changelog:
-
Changes from Take2 to Take3:
1) Return type is now described in the interface description
   above.
2) Patches rebased to 2.6.22-rc1 kernel.

** Each post will have an individual changelog for a particular patch.


Following patches follow:
Patch 1/6 : fallocate() implementation on i86, x86_64 and powerpc
Patch 2/6 : fallocate() on s390
Patch 3/6 : fallocate() on ia64
Patch 4/6 : ext4: Extent overlap bugfix
Patch 5/6 : ext4: fallocate support in ext4
Patch 6/6 : ext4: write support for preallocated blocks

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6][TAKE4] fallocate() on s390

2007-05-17 Thread Amit K. Arora
This is the patch suggested by Martin Schwidefsky to support
sys_fallocate() on s390(x) platform.

He also suggested a wrapper in glibc to handle this system call on
s390. Posting it here so that we get feedback for this too.

.globl __fallocate
ENTRY(__fallocate)
stm %r6,%r7,28(%r15)/* save %r6/%r7 on stack */
cfi_offset (%r7, -68)
cfi_offset (%r6, -72)
lm  %r6,%r7,96(%r15)/* load loff_t len from stack */
svc SYS_ify(fallocate)
lm  %r6,%r7,28(%r15)/* restore %r6/%r7 from stack */
br  %r14
PSEUDO_END(__fallocate)


Here are the comments and the patch to linux kernel from him.

-
From: Martin Schwidefsky [EMAIL PROTECTED]

This patch implements support of fallocate system call on s390(x)
platform. A wrapper is added to address the issue which s390 ABI has
with the arguments of this system call.

Signed-off-by: Martin Schwidefsky [EMAIL PROTECTED]
---
 arch/s390/kernel/compat_wrapper.S |   10 ++
 arch/s390/kernel/sys_s390.c   |   29 +
 arch/s390/kernel/syscalls.S   |1 +
 include/asm-s390/unistd.h |3 ++-
 4 files changed, 42 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S
===
--- linux-2.6.22-rc1.orig/arch/s390/kernel/compat_wrapper.S
+++ linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S
@@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper:
llgtr   %r2,%r2 # char *
llgtr   %r3,%r3 # struct compat_timeval *
jg  compat_sys_utimes
+
+   .globl  sys_fallocate_wrapper
+sys_fallocate_wrapper:
+   lgfr%r2,%r2 # int
+   lgfr%r3,%r3 # int
+   sllg%r4,%r4,32  # get high word of 64bit loff_t
+   lr  %r4,%r5 # get low word of 64bit loff_t
+   sllg%r5,%r6,32  # get high word of 64bit loff_t
+   l   %r5,164(%r15)   # get low word of 64bit loff_t
+   jg  sys_fallocate
Index: linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c
===
--- linux-2.6.22-rc1.orig/arch/s390/kernel/sys_s390.c
+++ linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c
@@ -265,3 +265,32 @@ s390_fadvise64_64(struct fadvise64_64_ar
return -EFAULT;
return sys_fadvise64_64(a.fd, a.offset, a.len, a.advice);
 }
+
+#ifndef CONFIG_64BIT
+/*
+ * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last
+ * 64 bit argument len is split into the upper and lower 32 bits. The
+ * system call wrapper in the user space loads the value to %r6/%r7.
+ * The code in entry.S keeps the values in %r2 - %r6 where they are and
+ * stores %r7 to 96(%r15). But the standard C linkage requires that
+ * the whole 64 bit value for len is stored on the stack and doesn't
+ * use %r6 at all. So s390_fallocate has to convert the arguments from
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len
+ * to
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len
+ */
+asmlinkage long s390_fallocate(int fd, int mode, loff_t offset,
+  u32 len_high, u32 len_low)
+{
+   union {
+   u64 len;
+   struct {
+   u32 high;
+   u32 low;
+   };
+   } cv;
+   cv.high = len_high;
+   cv.low = len_low;
+   return sys_fallocate(fd, mode, offset, cv.len);
+}
+#endif
Index: linux-2.6.22-rc1/arch/s390/kernel/syscalls.S
===
--- linux-2.6.22-rc1.orig/arch/s390/kernel/syscalls.S
+++ linux-2.6.22-rc1/arch/s390/kernel/syscalls.S
@@ -322,3 +322,4 @@ NI_SYSCALL  
/* 310 sys_move_pages *
 SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
 SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
 SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
+SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper)
Index: linux-2.6.22-rc1/include/asm-s390/unistd.h
===
--- linux-2.6.22-rc1.orig/include/asm-s390/unistd.h
+++ linux-2.6.22-rc1/include/asm-s390/unistd.h
@@ -251,8 +251,9 @@
 #define __NR_getcpu311
 #define __NR_epoll_pwait   312
 #define __NR_utimes313
+#define __NR_fallocate 314
 
-#define NR_syscalls 314
+#define NR_syscalls 315
 
 /* 
  * There are some system calls that are not present on 64 bit, some
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6][TAKE4] fallocate() on ia64

2007-05-17 Thread Amit K. Arora
Here is the 2.6.22-rc1 version of David's patch: add fallocate() on ia64

From: David Chinner [EMAIL PROTECTED]
Subject: [PATCH] ia64 fallocate syscall
Cc: Amit K. Arora [EMAIL PROTECTED], 
[EMAIL PROTECTED], [EMAIL PROTECTED],
[EMAIL PROTECTED], [EMAIL PROTECTED]

ia64 fallocate syscall support.

Signed-Off-By: Dave Chinner [EMAIL PROTECTED]

---
 arch/ia64/kernel/entry.S  |1 +
 include/asm-ia64/unistd.h |3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S
===
--- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S  2007-05-12 
18:45:56.0 -0700
+++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S   2007-05-15 15:36:48.0 
-0700
@@ -1585,5 +1585,6 @@
data8 sys_getcpu
data8 sys_epoll_pwait   // 1305
data8 sys_utimensat
+   data8 sys_fallocate

.org sys_call_table + 8*NR_syscalls // guard against failures to 
increase NR_syscalls
Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h
===
--- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h 2007-05-12 
18:45:56.0 -0700
+++ linux-2.6.22-rc1/include/asm-ia64/unistd.h  2007-05-15 15:37:51.0 
-0700
@@ -296,6 +296,7 @@
 #define __NR_getcpu1304
 #define __NR_epoll_pwait   1305
 #define __NR_utimensat 1306
+#define __NR_fallocate 1307

 #ifdef __KERNEL__


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6][TAKE4] ext4: Extent overlap bugfix

2007-05-17 Thread Amit K. Arora
This patch adds a check for overlap of extents and cuts short the
new extent to be inserted, if there is a chance of overlap.

Changelog:
-
Changes from Take3 to Take4:
 - no change -
Changes from Take2 to Take3:
 1) Patch rebased to 2.6.22-rc1 kernel.
Changes from Take1 to Take2:
 1) As suggested by Andrew, a check for wrap though zero has been added.

Here is the new patch:

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |   60 ++--
 include/linux/ext4_fs_extents.h |1 
 2 files changed, 59 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -1128,6 +1128,55 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * check if a portion of the newext extent overlaps with an
+ * existing extent.
+ *
+ * If there is an overlap discovered, it updates the length of the newext
+ * such that there will be no overlap, and then returns 1.
+ * If there is no overlap found, it returns 0.
+ */
+unsigned int ext4_ext_check_overlap(struct inode *inode,
+   struct ext4_extent *newext,
+   struct ext4_ext_path *path)
+{
+   unsigned long b1, b2;
+   unsigned int depth, len1;
+   unsigned int ret = 0;
+
+   b1 = le32_to_cpu(newext-ee_block);
+   len1 = le16_to_cpu(newext-ee_len);
+   depth = ext_depth(inode);
+   if (!path[depth].p_ext)
+   goto out;
+   b2 = le32_to_cpu(path[depth].p_ext-ee_block);
+
+   /*
+* get the next allocated block if the extent in the path
+* is before the requested block(s) 
+*/
+   if (b2  b1) {
+   b2 = ext4_ext_next_allocated_block(path);
+   if (b2 == EXT_MAX_BLOCK)
+   goto out;
+   }
+
+   /* check for wrap through zero */
+   if (b1 + len1  b1) {
+   len1 = EXT_MAX_BLOCK - b1;
+   newext-ee_len = cpu_to_le16(len1);
+   ret = 1;
+   }
+
+   /* check for overlap */
+   if (b1 + len1  b2) {
+   newext-ee_len = cpu_to_le16(b2 - b1);
+   ret = 1;
+   }
+out:
+   return ret;
+}
+
+/*
  * ext4_ext_insert_extent:
  * tries to merge requsted extent into the existing extent or
  * inserts requested extent as new one into the tree,
@@ -2031,7 +2080,15 @@ int ext4_ext_get_blocks(handle_t *handle
 
/* allocate new block */
goal = ext4_ext_find_goal(inode, path, iblock);
-   allocated = max_blocks;
+
+   /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
+   newex.ee_block = cpu_to_le32(iblock);
+   newex.ee_len = cpu_to_le16(max_blocks);
+   err = ext4_ext_check_overlap(inode, newex, path);
+   if (err)
+   allocated = le16_to_cpu(newex.ee_len);
+   else
+   allocated = max_blocks;
newblock = ext4_new_blocks(handle, inode, goal, allocated, err);
if (!newblock)
goto out2;
@@ -2039,7 +2096,6 @@ int ext4_ext_get_blocks(handle_t *handle
goal, newblock, allocated);
 
/* try to insert new extent into found leaf and return */
-   newex.ee_block = cpu_to_le32(iblock);
ext4_ext_store_pblock(newex, newblock);
newex.ee_len = cpu_to_le16(allocated);
err = ext4_ext_insert_extent(handle, inode, path, newex);
Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
===
--- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
@@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode *
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct 
ext4_ext_path *);
+extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent 
*, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct 
ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, 
ext_prepare_callback, void *);
 extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct 
ext4_ext_path *);
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6][TAKE4] ext4: fallocate support in ext4

2007-05-17 Thread Amit K. Arora
This patch implements -fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation.

Current implementation only supports preallocation for regular files
(directories not supported as of date) with extent maps. This patch
does not support block-mapped files currently.

Only FA_ALLOCATE mode is being supported as of now. Supporting
FA_DEALLOCATE mode is a ToDo item.

Changelog:
-
Changes from Take3 to Take4:
 1) Changed ext4_fllocate() declaration and definition to return a long
and not an int, to match with -fallocate() inode op.
 2) Update ctime if new blocks get allocated.
Changes from Take2 to Take3:
 1) Patch rebased to 2.6.22-rc1 kernel version.
 2) Removed unnecessary EXPORT_SYMBOL(ext4_fallocate);.
Changes from Take1 to Take2:
 1) Added more description for ext4_fallocate().
 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent).
 3) Moved journal_start  journal_stop inside the while loop.
 4) Replaced BUG_ON with WARN_ON  ext4_error.
 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally.
 6) Added variable names in the function declaration of ext4_fallocate()
 7) Converted macros that handle uninitialized extents into inline
functions.

Here is the updated patch:

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |  249 +---
 fs/ext4/file.c  |1 
 include/linux/ext4_fs.h |8 +
 include/linux/ext4_fs_extents.h |   12 +
 4 files changed, 229 insertions(+), 41 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in
} else if (path-p_ext) {
ext_debug(  %d:%d:%llu ,
  le32_to_cpu(path-p_ext-ee_block),
- le16_to_cpu(path-p_ext-ee_len),
+ ext4_ext_get_actual_len(path-p_ext),
  ext_pblock(path-p_ext));
} else
ext_debug(  []);
@@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ex++) {
ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block),
- le16_to_cpu(ex-ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug(\n);
 }
@@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug(  - %d:%llu:%d ,
le32_to_cpu(path-p_ext-ee_block),
ext_pblock(path-p_ext),
-   le16_to_cpu(path-p_ext-ee_len));
+   ext4_ext_get_actual_len(path-p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug(move %d:%llu:%d in new leaf %llu\n,
le32_to_cpu(path[depth].p_ext-ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext-ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1106,7 +1106,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1-ee_block) + ext1_ee_len !=
le32_to_cpu(ex2-ee_block))
return 0;
 
@@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len)  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1144,7 +1156,7 @@ 

[PATCH 6/6][TAKE4] ext4: write support for preallocated blocks

2007-05-17 Thread Amit K. Arora
This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

Changelog:
-
Changes from Take3 to Take4:
 - no change -
Changes from Take2 to Take3:
 1) Patch now rebased to 2.6.22-rc1 kernel.
Changes from Take1 to Take2:
 1) Replaced BUG_ON with WARN_ON  ext4_error.
 2) Added variable names to the function declaration of
ext4_ext_try_to_merge().
 3) Updated variable declarations to use multiple-definitions-per-line.
 4) if((a=foo())).. was broken into a=foo(); if(a)..
 5) Removed extra spaces.

Here is the updated patch:

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |  234 +++-
 include/linux/ext4_fs_extents.h |3 
 2 files changed, 210 insertions(+), 27 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -1140,6 +1140,54 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * This function tries to merge the ex extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass ex - 1 as argument instead of ex.
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *ex)
+{
+   struct ext4_extent_header *eh;
+   unsigned int depth, len;
+   int merge_done = 0;
+   int uninitialized = 0;
+
+   depth = ext_depth(inode);
+   BUG_ON(path[depth].p_hdr == NULL);
+   eh = path[depth].p_hdr;
+
+   while (ex  EXT_LAST_EXTENT(eh))
+   {
+   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+   break;
+   /* merge with next extent! */
+   if (ext4_ext_is_uninitialized(ex))
+   uninitialized = 1;
+   ex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+   + ext4_ext_get_actual_len(ex + 1));
+   if (uninitialized)
+   ext4_ext_mark_uninitialized(ex);
+
+   if (ex + 1  EXT_LAST_EXTENT(eh)) {
+   len = (EXT_LAST_EXTENT(eh) - ex - 1)
+   * sizeof(struct ext4_extent);
+   memmove(ex + 1, ex + 2, len);
+   }
+   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries) - 1);
+   merge_done = 1;
+   WARN_ON(eh-eh_entries == 0);
+   if (!eh-eh_entries)
+   ext4_error(inode-i_sb, ext4_ext_try_to_merge,
+  inode#%lu, eh-eh_entries = 0!, inode-i_ino);
+   }
+
+   return merge_done;
+}
+
+/*
  * check if a portion of the newext extent overlaps with an
  * existing extent.
  *
@@ -1327,25 +1375,7 @@ has_space:
 
 merge:
/* try to merge extents to the right */
-   while (nearex  EXT_LAST_EXTENT(eh)) {
-   if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-   break;
-   /* merge with next extent! */
-   if (ext4_ext_is_uninitialized(nearex))
-   uninitialized = 1;
-   nearex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-   + ext4_ext_get_actual_len(nearex + 1));
-   if (uninitialized)
-   ext4_ext_mark_uninitialized(nearex);
-
-   if (nearex + 1  EXT_LAST_EXTENT(eh)) {
-   len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-   * sizeof(struct ext4_extent);
-   memmove(nearex + 1, nearex + 2, len);
-   }
-   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries)-1);
-   BUG_ON(eh-eh_entries == 0);
-   }
+   ext4_ext_try_to_merge(inode, path, nearex);
 
/* try to merge extents to the left */
 
@@ -2011,15 +2041,152 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ *   a There is no split required: Entire extent should be initialized
+ *   b Splits in two extents: Write is happening at either end of the extent
+ *   c Splits in three extents: Somone is writing in middle of the extent
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+ 

Re: [PATCH] LogFS take three

2007-05-17 Thread Arnd Bergmann
On Tuesday 15 May 2007, Jörn Engel wrote:
 Add LogFS, a scalable flash filesystem.

Hi Jörn,

Sorry for not commenting earlier, there were so many discussions on version
two that I wanted to wait for the fallout of that instead of duplicating
all the comments.

Here are a few things I notice while going through the third version:

 +/*
 + * Private errno for accessed beyond end-of-file.  Only used internally to
 + * logfs.  If this ever gets exposed to userspace or even other parts of the
 + * kernel, it is a bug.  256 was chosen as a number sufficiently above all
 + * used errno #defines.
 + *
 + * It can be argued that this is a hack and should be replaced with something
 + * else.  My last attempt to do this failed spectacularly and there are more
 + * urgent problems that users actually care about.  This will remain for the
 + * moment.  Patches are wellcome, of course.
 + */
 +#define EOF  256

It should at least be in the kernel-only errno range between 512 and 4095,
that way it can eventually be added to include/linux/errno.h.

 + * Target rename works in three atomic steps:
 + * 1. Attach old inode to new dentry (remember old dentry and new inode)
 + * 2. Remove old dentry (still remember the new inode)
 + * 3. Remove new inode
 + *
 + * Here we remember both an inode an a dentry.  If we get interrupted
 + * between steps 1 and 2, we delete both the dentry and the inode.  If
 + * we get interrupted between steps 2 and 3, we delete just the inode.
 + * In either case, the remaining objects are deleted on next mount.  From
 + * a users point of view, the operation succeeded.

This description had me confused for a while: why would you remove the
new inode. Maybe change the text to say 'target inode' or 'victim inode'?

 +static int logfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 +{
 + struct inode *inode;
 +
 + if (dir-i_nlink = LOGFS_LINK_MAX)
 + return -EMLINK;

Why is i_nlink limited? Don't you run out of space for inodes before
overflowing?

 + * In principle, this function should loop forever, looking for GC candidates
 + * and moving data.  LogFS is designed in such a way that this loop is
 + * guaranteed to terminate.
 + *
 + * Limiting the loop to four iterations serves purely to catch cases when
 + * these guarantees have failed.  An actual endless loop is an obvious bug
 + * and should be reported as such.
 + *
 + * But there is another nasty twist to this.  As I have described in my LCA
 + * presentation, Garbage collection would have to limit itself to higher
 + * levels if the number of available free segments goes down.  This code
 + * doesn't and should fail spectacularly.  Yet - hard as I tried I haven't
 + * been able to make it fail (short of a bug elsewhere).
 + *
 + * So in a way this code is intentionally wrong as a desperate cry for a
 + * better testcase.  And I do expect to get blamed for it one day. :(
 + */

Could you bug the code to reserve fewer segments for GC than you really
need, in order to stress test GC?

 +static struct inode *logfs_alloc_inode(struct super_block *sb)
 +{
 + struct logfs_inode *li;
 +
 + li = kmem_cache_alloc(logfs_inode_cache, GFP_KERNEL);
 + if (!li)
 + return NULL;
 + logfs_init_inode(li-vfs_inode);
 + return li-vfs_inode;
 +}
 +
 +
 +struct inode *logfs_new_meta_inode(struct super_block *sb, u64 ino)
 +{
 + struct inode *inode;
 +
 + inode = logfs_alloc_inode(sb);
 + if (!inode)
 + return ERR_PTR(-ENOMEM);
 +
 + logfs_init_inode(inode);

logfs_alloc_inode() returns an initialized inode, so no need to call
logfs_init_inode() again, right?

 +static __be64 timespec_to_be64(struct timespec tsp)
 +{
 + u64 time = ((u64)tsp.tv_sec  32) + (tsp.tv_nsec  0x);
 +
 + WARN_ON(tsp.tv_nsec  9);
 + return cpu_to_be64(time);
 +}

Why not just store 64 bit nanoseconds? that would avoid the problem
with ns overflow and the year-2038 bug. OTOH, that would require
a 64 bit integer division when reading the data, so it gets you
a runtime overhead.

 +static void logfs_read_inode(struct inode *inode)
 +{
 + int ret;
 +
 + BUG_ON(inode-i_ino == LOGFS_INO_MASTER);
 +
 + ret = __logfs_read_inode(inode);
 +
 + /* What else can we do here? */
 + BUG_ON(ret);
 +}

ext2 returns make_bad_inode(inode) in this case, which seems to be
a better solution than crashing.

 +int __logfs_write_inode(struct inode *inode)
 +{
 + /*
 +  * FIXME: Those two inodes are 512 bytes in total.  Not good to
 +  * have on the stack.  Possibly the best solution would be to bite
 +  * the bullet and do another format change before release and
 +  * shrink the inodes.
 +  */
 + struct logfs_disk_inode old, new;
 +
 + BUG_ON(inode-i_ino == LOGFS_INO_MASTER);
 +
 + /* read and compare the inode first.  If it hasn't changed, don't
 +  * bother writing it. */
 + logfs_inode_to_disk(inode, new);
 + if 

RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-17 Thread Jan Engelhardt

On May 17 2007 21:11, Neil Brown wrote:
On Thursday May 17, [EMAIL PROTECTED] wrote:
 XOR it (0^0=1), and hence fills up the host disk.

Uhmm... you need to check your maths.

$ perl -e 'printf %d\n, 0^0;'
0

:-)

(ouch)
You know just as I that ^ is the power operator!
I just... wrongly named it XOR :p

$ echo '0^0' | bc -l
1


Well, right, setting up a blank raid5 array inside vmware will not make
the host file significantly larger, making it easy to build megatera
arrays with gigabyte range host disks.


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Review status (Re: [PATCH] LogFS take three)

2007-05-17 Thread Evgeniy Polyakov

Hi Jörn.

Is logfs 32bit fs or 674bit, since although you use 64bit values for
offsets, area management and strange converstions like described below 
from offset into segment number are performed in 32bit?
Is it enough for SSD for example to be 32bit only? Or if it is 64bit,
could you please explain logic behind area management?

I've found that you store segment numbers as 32bit values (for example
in prepare_write()), and convert requested 64bit offset into segment
number via superblock's s_segshift.
This conversation seems confusing to me in case of real 64bit offsets.
For example this one obtained via prepare_write:

7  1 logfs_prepare_write78  fs/logfs/file.c
8  2 logfs_readpage_nolock20  fs/logfs/file.c
9  1 logfs_read_block   351  fs/logfs/readwrite.c
10  1 logfs_read_loop   139  fs/logfs/readwrite.c
11  2 logfs_segment_read   108  fs/logfs/readwrite.c
12  1 wbuf_read 289 

u32 segno = ofs  super-s_segshift;

ofs is originally obtained from inode's li_data array, which is filled
with raw segment numbers which can be 64bit (here is another issue,
since logfs_segment_write() returns signed, so essentially logfs is
63bit filesystem).

But here I've came to area management in logfs, and found that it is
32bit only, for example __logfs_segment_write()/__logfs_get_free_bytes() 
returns signed 32 bit value (so it is reduced to 31 bit), which is then 
placed into li_data as 64bit value. The latter
(__logfs_get_free_bytes()) truncates 64bit data value obtained via
dev_ofs() into signed 32 bit value.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Review status (Re: [PATCH] LogFS take three)

2007-05-17 Thread Jörn Engel
On Thu, 17 May 2007 20:03:11 +0400, Evgeniy Polyakov wrote:
 
 Is logfs 32bit fs or 674bit, since although you use 64bit values for
 offsets, area management and strange converstions like described below 
 from offset into segment number are performed in 32bit?
 Is it enough for SSD for example to be 32bit only? Or if it is 64bit,
 could you please explain logic behind area management?

Ignoring bugs and signed return values for error handling, it is either
64bit or 32+32bit.

Inode numbers and file positions are 64bit.  Offsets are 64bit as well.
In a couple of places, offsets are also 32+32bit.  Basically the high
bits contain the segment number, the lower bits the offset within a
segment.

Side note: It would be nicer if the high 32bit were segment number.
Instead the number of bits depends on segment size.  Guess I should
change that while the format isn't fixed yet.

An area is a segment that is currently being written.  Data is
appended to this segment as it comes in, until the segment is full.  Any
functions dealing with areas only need a 32bit offset, which is the
offset within the area, not the absolute device offset.

Writes within an area are also buffered.  New data first goes into the
write buffer (wbuf) and only when this is full is it flushed to the
device.  NAND flash and some NOR flashes require such buffering.  When
writing to the device, the 32bit segno and the 32bit in-segment offset
need to get converted back to a 64bit device offset.

 I've found that you store segment numbers as 32bit values (for example
 in prepare_write()), and convert requested 64bit offset into segment
 number via superblock's s_segshift.

Yes, as described above.

 This conversation seems confusing to me in case of real 64bit offsets.
 For example this one obtained via prepare_write:
 
 7  1 logfs_prepare_write78  fs/logfs/file.c
 8  2 logfs_readpage_nolock20  fs/logfs/file.c
 9  1 logfs_read_block   351  fs/logfs/readwrite.c
 10  1 logfs_read_loop   139  fs/logfs/readwrite.c
 11  2 logfs_segment_read   108  fs/logfs/readwrite.c
 12  1 wbuf_read 289 
 
 u32 segno = ofs  super-s_segshift;
 
 ofs is originally obtained from inode's li_data array, which is filled
 with raw segment numbers which can be 64bit (here is another issue,
 since logfs_segment_write() returns signed, so essentially logfs is
 63bit filesystem).

The filesystem format is 64bit.  The current code can only deal with
63bit.  Eric Sandeen just fixed ext2 to actually deal with 32bit
numbers and the same is possible for logfs.  If anyone ever cares...

 But here I've came to area management in logfs, and found that it is
 32bit only, for example __logfs_segment_write()/__logfs_get_free_bytes() 
 returns signed 32 bit value (so it is reduced to 31 bit), which is then 
 placed into li_data as 64bit value. The latter
 (__logfs_get_free_bytes()) truncates 64bit data value obtained via
 dev_ofs() into signed 32 bit value.

That indeed is a bug.  __logfs_get_free_bytes() should return s64
instead of s32.  Will fix immediatly.

If anyone can find similar bugs, the bounty is a beer or non-alcoholic
beverage of choice. :)

Jörn

-- 
To announce that there must be no criticism of the President, or that we
are to stand by the President, right or wrong, is not only unpatriotic
and servile, but is morally treasonable to the American public.
-- Theodore Roosevelt, Kansas City Star, 1918
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jan Engelhardt

On May 16 2007 13:09, Jörn Engel wrote:
On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote:
 
 Personally I'd just go for 'JFFS3'. After all, it has a better claim to
 the name than either of its predecessors :)

Did you ever see akpm's facial expression when he tried to pronounce
JFFS2?  ;)

Is there something special with [dʒeɪ ɛf ɛf ɛs tuː]?


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jan Engelhardt

On May 16 2007 14:55, Jörn Engel wrote:
On Wed, 16 May 2007 16:29:22 +0400, Evgeniy Polyakov wrote:
 On Wed, May 16, 2007 at 01:50:03PM +0200, Jörn Engel ([EMAIL PROTECTED]) 
 wrote:
  On Wed, 16 May 2007 12:34:34 +0100, Jamie Lokier wrote:
   
   But if akpm can't pronounce it, how about FFFS for faster flash
   filesystem ;-)
  
  How many of you have worked for IBM before?  Vowels are not evil. ;)
 
 Do you think 'eieio' is a good set? IBM's work too...

C'mon, UIO does not cut IIO either ;-)


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jan Engelhardt

On May 16 2007 22:06, CaT wrote:
On Wed, May 16, 2007 at 01:50:03PM +0200, J??rn Engel wrote:
 On Wed, 16 May 2007 12:34:34 +0100, Jamie Lokier wrote:
  
  But if akpm can't pronounce it, how about FFFS for faster flash
  filesystem ;-)
 
 How many of you have worked for IBM before?  Vowels are not evil. ;)
 
 Grouping four or more consonants to name anything will cause similar
 expressions on people's faces.  Numbers don't help much either.
 
 Ext2 is a great name, because ext actually is a pronouncable syllable.
 MinixFS, ChunkFS, TileFS are great too.  XFS and JFS are ok, at least
 they only have three consonants.  But FFS exists, so I'd rather go for a
 syllable.

FlashFS?

Or just try once dropping all those redundant 'fs' suffixes.
bdev, proc, cpuset, devpts, mqueue, fuse(blk|ctl), vfat, iso9660, etc.
Then there's much more space for innovative names.


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jan Engelhardt

On May 16 2007 15:53, Jörn Engel wrote:

My experience is that no matter which name I pick, people will complain
anyway.  Previous suggestions included:
[...]

Plus today:
FFFS
flashfs
fredfs
bob
shizzle

Imo they all suck.  LogFS also sucks, but it allows me to make a stupid
joke and keep my logfs.org domain.

Try woodfs! (log - wood - get it?)
But finding names can be so tiresome, just give it a Borg-style
designation - filesystem 125 or so. fs2007q1, being this
quartal's new filesystem.


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jan Engelhardt

On May 16 2007 02:06, Jörn Engel wrote:

  +/* memtree.c */
  +void btree_init(struct btree_head *head);
  +void *btree_lookup(struct btree_head *head, long val);
  +int btree_insert(struct btree_head *head, long val, void *ptr);
  +int btree_remove(struct btree_head *head, long val);
 
 These names are too generic.  If we later add a btree library: blam.

My plan was to move this code to lib/ sooner or later.  If you consider
it useful in its current state, I can do it immediatly.  And if someone
else merged a superior btree library I'd happily remove mine and use the
new one instead.

Opinions?

Why would we need another btree, when there is lib/rbtree.c?
Or does yours do something fundamentally different?


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Evgeniy Polyakov
On Thu, May 17, 2007 at 07:26:07PM +0200, Jan Engelhardt ([EMAIL PROTECTED]) 
wrote:
 My plan was to move this code to lib/ sooner or later.  If you consider
 it useful in its current state, I can do it immediatly.  And if someone
 else merged a superior btree library I'd happily remove mine and use the
 new one instead.
 
 Opinions?
 
 Why would we need another btree, when there is lib/rbtree.c?
 Or does yours do something fundamentally different?

It is not red-black tree, it is b+ tree.

   Jan

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jörn Engel
On Thu, 17 May 2007 17:08:51 +0200, Arnd Bergmann wrote:
 On Tuesday 15 May 2007, Jörn Engel wrote:
  Add LogFS, a scalable flash filesystem.
 
 Sorry for not commenting earlier, there were so many discussions on version
 two that I wanted to wait for the fallout of that instead of duplicating
 all the comments.

You are the last person that has to be sorry. ;)

 Here are a few things I notice while going through the third version:
 
  +/*
  + * Private errno for accessed beyond end-of-file.  Only used internally to
  + * logfs.  If this ever gets exposed to userspace or even other parts of 
  the
  + * kernel, it is a bug.  256 was chosen as a number sufficiently above all
  + * used errno #defines.
  + *
  + * It can be argued that this is a hack and should be replaced with 
  something
  + * else.  My last attempt to do this failed spectacularly and there are 
  more
  + * urgent problems that users actually care about.  This will remain for 
  the
  + * moment.  Patches are wellcome, of course.
  + */
  +#define EOF256
 
 It should at least be in the kernel-only errno range between 512 and 4095,
 that way it can eventually be added to include/linux/errno.h.

Fair enough.  512 it is.

  + * Target rename works in three atomic steps:
  + * 1. Attach old inode to new dentry (remember old dentry and new inode)
  + * 2. Remove old dentry (still remember the new inode)
  + * 3. Remove new inode
  + *
  + * Here we remember both an inode an a dentry.  If we get interrupted
  + * between steps 1 and 2, we delete both the dentry and the inode.  If
  + * we get interrupted between steps 2 and 3, we delete just the inode.
  + * In either case, the remaining objects are deleted on next mount.  From
  + * a users point of view, the operation succeeded.
 
 This description had me confused for a while: why would you remove the
 new inode. Maybe change the text to say 'target inode' or 'victim inode'?

'Victim inode' sounds good.  Will do.

  +static int logfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
  +{
  +   struct inode *inode;
  +
  +   if (dir-i_nlink = LOGFS_LINK_MAX)
  +   return -EMLINK;
 
 Why is i_nlink limited? Don't you run out of space for inodes before
 overflowing?

I don't know.  With the current limit of 2^31, a sufficiently large
device can reach the limit.  And it is imaginable that overflowing the
s32 number space can expose security holes.  Not that I actually know,
the check is pure paranoia.

  + * In principle, this function should loop forever, looking for GC 
  candidates
  + * and moving data.  LogFS is designed in such a way that this loop is
  + * guaranteed to terminate.
  + *
  + * Limiting the loop to four iterations serves purely to catch cases when
  + * these guarantees have failed.  An actual endless loop is an obvious bug
  + * and should be reported as such.
  + *
  + * But there is another nasty twist to this.  As I have described in my LCA
  + * presentation, Garbage collection would have to limit itself to higher
  + * levels if the number of available free segments goes down.  This code
  + * doesn't and should fail spectacularly.  Yet - hard as I tried I haven't
  + * been able to make it fail (short of a bug elsewhere).
  + *
  + * So in a way this code is intentionally wrong as a desperate cry for a
  + * better testcase.  And I do expect to get blamed for it one day. :(
  + */
 
 Could you bug the code to reserve fewer segments for GC than you really
 need, in order to stress test GC?

I could.  Wear leveling will cause changes in the area, so I'll have a
closer look when implementing that.

  +static struct inode *logfs_alloc_inode(struct super_block *sb)
  +{
  +   struct logfs_inode *li;
  +
  +   li = kmem_cache_alloc(logfs_inode_cache, GFP_KERNEL);
  +   if (!li)
  +   return NULL;
  +   logfs_init_inode(li-vfs_inode);
  +   return li-vfs_inode;
  +}
  +
  +
  +struct inode *logfs_new_meta_inode(struct super_block *sb, u64 ino)
  +{
  +   struct inode *inode;
  +
  +   inode = logfs_alloc_inode(sb);
  +   if (!inode)
  +   return ERR_PTR(-ENOMEM);
  +
  +   logfs_init_inode(inode);
 
 logfs_alloc_inode() returns an initialized inode, so no need to call
 logfs_init_inode() again, right?

Right.  Will change.

  +static __be64 timespec_to_be64(struct timespec tsp)
  +{
  +   u64 time = ((u64)tsp.tv_sec  32) + (tsp.tv_nsec  0x);
  +
  +   WARN_ON(tsp.tv_nsec  9);
  +   return cpu_to_be64(time);
  +}
 
 Why not just store 64 bit nanoseconds? that would avoid the problem
 with ns overflow and the year-2038 bug. OTOH, that would require
 a 64 bit integer division when reading the data, so it gets you
 a runtime overhead.

I like the idea.  Do conversion function exist both way?

What I don't get is the year-2038 bug.  Isn't that the 31bit limit,
while 32bit would last to 2106?

  +static void logfs_read_inode(struct inode *inode)
  +{
  +   int ret;
  +
  +   BUG_ON(inode-i_ino == LOGFS_INO_MASTER);
  +
  +  

Re: [PATCH] LogFS take three

2007-05-17 Thread Pavel Machek
Hi!

 Yes. These things are almost always implemented _very_ 
 badly by the same
 kind of crack-smoking hobo they drag in off the streets 
 to write BIOSen.
 
 It's bog-roll technology; if you fancy a laugh try 
 doing some real
 reliability tests on them time some. Powerfail testing 
 is a good one.
 
 This kind of thing is OK for disposable storage such as 
 in digital
 cameras, where it doesn't matter that it's no more 
 reliable than a
 floppy disc, but for real long-term storage it's really 
 a bad idea.
 
 
 There are so many flash-based storage and some 
 disposable storages,
 as you pointed out, have poor quality. I think it's 
 mainly because these
 are not designed for good quality, but for lowering the 
 price.
 
 These kind of devices are not ready for things like 
 power failure because
 their use case is far from that. For example, removing 
 flash card
 while taking pictures using digital camera is not a 
 common use case.
 (there should be a written notice that this kind of 
 action is against
 the warranty)

Hmm.. so operating your camera on batteries should be against the
warranty, since batteries commonly run empty while storing pictures?


Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Pekka Enberg

Jörn Engel wrote:

Compressing random data will actually enlarge it.  If that happens I
simply store the verbatim uncompressed data instead and mark it as such.

There is also demand for a user-controlled bit in the inode to disable
compression completely.  All those .jpg, .mpg, .mp3, etc. just waste
time by trying and failing to compress them.


So any sane way to enable compression is on per-inode basis which makes 
me still wonder why you need per-object compression.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Arnd Bergmann
On Thursday 17 May 2007, Jörn Engel wrote:
 
  Why not just store 64 bit nanoseconds? that would avoid the problem
  with ns overflow and the year-2038 bug. OTOH, that would require
  a 64 bit integer division when reading the data, so it gets you
  a runtime overhead.
 
 I like the idea.  Do conversion function exist both way?
 
 What I don't get is the year-2038 bug.  Isn't that the 31bit limit,
 while 32bit would last to 2106?

You're right, you don't hit the 2038 bug here, because you use an
unsigned variable. The bug exists elsewhere because time_t tv_sec
is signed.

Just using nanoseconds probably doesn't gain you much after all
then. You could however just have separate 32 bit fields in the
inode for seconds and nanoseconds, that will result in the exact
same layout that you have right now, but won't require a conversion
function.

Arnd 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jörn Engel
On Thu, 17 May 2007 23:00:20 +0200, Arnd Bergmann wrote:
 
 Just using nanoseconds probably doesn't gain you much after all
 then. You could however just have separate 32 bit fields in the
 inode for seconds and nanoseconds, that will result in the exact
 same layout that you have right now, but won't require a conversion
 function.

I could also have a 30bit and a 34bit field.  30bit is enough for
nanoseconds.  So many options.

Jörn

-- 
Time? What's that? Time is only worth what you do with it.
-- Theo de Raadt
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Arnd Bergmann
On Thursday 17 May 2007, Pekka Enberg wrote:
 
 Jörn Engel wrote:
  Compressing random data will actually enlarge it.  If that happens I
  simply store the verbatim uncompressed data instead and mark it as such.
  
  There is also demand for a user-controlled bit in the inode to disable
  compression completely.  All those .jpg, .mpg, .mp3, etc. just waste
  time by trying and failing to compress them.
 
 So any sane way to enable compression is on per-inode basis which makes 
 me still wonder why you need per-object compression.

1. it doesn't require user interaction, the file system will do the right
thing most of the time.

2. enlarging data is a very bad thing because it makes the behaviour
of the fs unpredictable. With uncompressed objects, you have a guaranteed
upper bound on the size.

Arnd 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jörn Engel
On Thu, 17 May 2007 23:36:13 +0200, Arnd Bergmann wrote:
 On Thursday 17 May 2007, Pekka Enberg wrote:
  
  So any sane way to enable compression is on per-inode basis which makes 
  me still wonder why you need per-object compression.
 
 1. it doesn't require user interaction, the file system will do the right
 thing most of the time.
 
 2. enlarging data is a very bad thing because it makes the behaviour
 of the fs unpredictable. With uncompressed objects, you have a guaranteed
 upper bound on the size.

Correct.  The compression decision is always per-object.  Per-inode is a
hint from userspace that a compression attempt would be futile.

A compression algorithm that compresses any data is provably impossible.
Some data will always cause expansion instead of compression.  Some
algorithms have a well-known upper bound on the expansion, others don't.
So LogFS instead creates its own upper bound by reserving one byte in
the header for the compression type.

And while one bit would suffice as a compressed/uncompressed flag,
having a byte allows to support more than one compression algorithm.
LZO looks promising and is on its way into the kernel.  Others may come
in the future.

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-17 Thread Jeff Zheng
 Fix confirmed, filled the whole 11T hard disk, without crashing.
I presume this would go into 2.6.22

Thanks again.

Jeff

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Jeff Zheng
 Sent: Thursday, 17 May 2007 5:39 p.m.
 To: Neil Brown; [EMAIL PROTECTED]; Michal Piotrowski; Ingo 
 Molnar; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; linux-fsdevel@vger.kernel.org
 Subject: RE: Software raid0 will crash the file-system, when 
 each disk is 5TB
 
 
 Yeah, seems you've locked it down, :D. I've written 600GB of 
 data now, and anything is still fine.
 Will let it run overnight, and fill the whole 11T. I'll post 
 the result tomorrow
 
 Thanks a lot though.
 
 Jeff 
 
  -Original Message-
  From: Neil Brown [mailto:[EMAIL PROTECTED]
  Sent: Thursday, 17 May 2007 5:31 p.m.
  To: [EMAIL PROTECTED]; Jeff Zheng; Michal Piotrowski; Ingo Molnar; 
  [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
  linux-fsdevel@vger.kernel.org
  Subject: RE: Software raid0 will crash the file-system, 
 when each disk 
  is 5TB
  
  On Thursday May 17, [EMAIL PROTECTED] wrote:
   
   Uhm, I just noticed something.
   'chunk' is unsigned long, and when it gets shifted up, we
  might lose
   bits.  That could still happen with the 4*2.75T 
 arrangement, but is 
   much more likely in the 2*5.5T arrangement.
  
  Actually, it cannot be a problem with the 4*2.75T arrangement.
chuck  chunksize_bits
  
  will not exceed the size of the underlying device *in*kilobytes*.
  In that case that is 0xAE9EC800 which will git in a 32bit long.
  We don't double it to make sectors until after we add
  zone-dev_offset, which is sector_t and so 64bit 
 arithmetic is used.
  
  So I'm quite certain this bug will cause exactly the problems 
  experienced!!
  
   
   Jeff, can you try this patch?
  
  Don't bother about the other tests I mentioned, just try this one.
  Thanks.
  
  NeilBrown
  
   Signed-off-by: Neil Brown [EMAIL PROTECTED]
   
   ### Diffstat output
./drivers/md/raid0.c |2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
   
   diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
   --- .prev/drivers/md/raid0.c  2007-05-17 
  10:33:30.0 +1000
   +++ ./drivers/md/raid0.c  2007-05-17 15:02:15.0 +1000
   @@ -475,7 +475,7 @@ static int raid0_make_request (request_q
 x = block  chunksize_bits;
 tmp_dev = zone-dev[sector_div(x, zone-nb_dev)];
 }
   - rsect = (((chunk  chunksize_bits) + zone-dev_offset)1)
   + rsect = sector_t)chunk  chunksize_bits) +
   +zone-dev_offset)1)
 + sect_in_chunk;
 
 bio-bi_bdev = tmp_dev-bdev;
  
 -
 To unsubscribe from this list: send the line unsubscribe 
 linux-raid in the body of a message to 
 [EMAIL PROTECTED] More majordomo info at  
 http://vger.kernel.org/majordomo-info.html
 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jamie Lokier
Jörn Engel wrote:
  Almost all your static functions start with logfs_, why not this one?
 
 Because after a while I discovered how silly it is to start every
 function with logfs_.  That prefix doesn't add much unless the function
 has global scope.  What I didn't do was remove the prefix from older
 functions.

It's handy when debugging or showing detailed backtraces.  Not that
I'm advocating it (or not), just something I've noticed in other
programs.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Dongjun Shin

Hi,

On 5/18/07, Pavel Machek [EMAIL PROTECTED] wrote:

Hi!


Hmm.. so operating your camera on batteries should be against the
warranty, since batteries commonly run empty while storing pictures?




AFAIK, the camera stops writing to the flash card and automatically
turns off when it's low on battery (before empty).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-17 Thread Neil Brown
On Friday May 18, [EMAIL PROTECTED] wrote:
  Fix confirmed, filled the whole 11T hard disk, without crashing.
 I presume this would go into 2.6.22

Yes, and probably 2.6.21.y, though the patch will be slightly
different, see below.
 
 Thanks again.

And thank-you for pursuing this with me.

NeilBrown


---
Avoid overflow in raid0 calculation with large components.

If a raid0 has a component device larger than 4TB, and is accessed on
a 32bit machines, then as 'chunk' is unsigned lock,
   chunk  chunksize_bits
can overflow (this can be as high as the size of the device in KB).
chunk itself will not overflow (without triggering a BUG).

So change 'chunk' to be 'sector_t, and get rid of the 'BUG' as it becomes
impossible to hit.

Cc: Jeff Zheng [EMAIL PROTECTED]
Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid0.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
--- .prev/drivers/md/raid0.c2007-05-17 10:33:30.0 +1000
+++ ./drivers/md/raid0.c2007-05-17 16:14:12.0 +1000
@@ -415,7 +415,7 @@ static int raid0_make_request (request_q
raid0_conf_t *conf = mddev_to_conf(mddev);
struct strip_zone *zone;
mdk_rdev_t *tmp_dev;
-   unsigned long chunk;
+   sector_t chunk;
sector_t block, rsect;
const int rw = bio_data_dir(bio);
 
@@ -470,7 +470,6 @@ static int raid0_make_request (request_q
 
sector_div(x, zone-nb_dev);
chunk = x;
-   BUG_ON(x != (sector_t)chunk);
 
x = block  chunksize_bits;
tmp_dev = zone-dev[sector_div(x, zone-nb_dev)];
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Kyle Moffett

On May 17, 2007, at 13:45:33, Evgeniy Polyakov wrote:
On Thu, May 17, 2007 at 07:26:07PM +0200, Jan Engelhardt  
([EMAIL PROTECTED]) wrote:
My plan was to move this code to lib/ sooner or later.  If you  
consider it useful in its current state, I can do it immediatly.   
And if someone else merged a superior btree library I'd happily  
remove mine and use the new one instead.


Opinions?


Why would we need another btree, when there is lib/rbtree.c?  Or  
does yours do something fundamentally different?


It is not red-black tree, it is b+ tree.


It might be better to use the prefix bptree to help prevent  
confusion.  A quick google search on bp-tree reveals only the perl B 
+-tree module Tree::BPTree, a U-Maryland Java CS project on B+- 
trees, and a news article about a BP tree-top protest.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-17 Thread Nick Piggin

David Howells wrote:

Nick Piggin [EMAIL PROTECTED] wrote:



No, you shouldn't. We could theoretically introduce a new API for this,
but I think it would be preferable if you can fix the race in the fs.



Actually, I might be able to do better.

When making a StoreData call to the AFS server, I send all the parameters
first, and at that point, the server will abort it, I think, if permission is
not available, and won't wait for the payload to be delivered.

So if I tell AF_RXRPC to send the parameter data with an explicit ACK request
and then wait till it's either hard-ACK'd or aborted, I should then be able to
deal with the permissions failure at a state where I have locked *all* the
pages to be sent.

At that point, I should be able to tell truncate to simple discard all these
locked pages.

How's that sound?


Truncate as it stands still needs to be given unlocked pages. So we would
either have to create a new API, or I think preferably it would be nice if
you could see if you can first solve it with a private lock?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html