Re: [PATCH] LogFS take three

2007-05-17 Thread David Woodhouse
On Fri, 2007-05-18 at 08:17 +0200, Jan Engelhardt wrote:
> > AFAIK, the camera stops writing to the flash card and automatically
> > turns off when it's low on battery (before empty).
> 
> But then, one should also consider the case where a cam is connected to
> AC and someone inadvertently trips on the power cord. 

So you stick a bloody great cap on board to give you enough time to shut
it down cleanly. I've known people do this -- and it helps, but the
devices still manage to crap themselves occasionally even then.

They're _disposable_. As are your data :)

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jan Engelhardt

On May 18 2007 09:01, Dongjun Shin wrote:
> On 5/18/07, Pavel Machek <[EMAIL PROTECTED]> wrote:
>> 
>> Hmm.. so operating your camera on batteries should be against the
>> warranty, since batteries commonly run empty while storing pictures?
>
> AFAIK, the camera stops writing to the flash card and automatically
> turns off when it's low on battery (before empty).

But then, one should also consider the case where a cam is connected to
AC and someone inadvertently trips on the power cord.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jan Engelhardt

On May 17 2007 21:00, Kyle Moffett wrote:
>> > > Opinions?
>> > 
>> > Why would we need another btree, when there is lib/rbtree.c?  Or does
>> > yours do something fundamentally different?
>> 
>> It is not red-black tree, it is b+ tree.
>
> It might be better to use the prefix "bptree" to help prevent confusion.  A
> quick google search on "bp-tree" reveals only the perl B+-tree module
> "Tree::BPTree", a U-Maryland Java CS project on B+-trees, and a news article
> about a "BP tree-top protest".

BP heh.. How about "struct bplustree"?


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-17 Thread Nick Piggin

David Howells wrote:

Nick Piggin <[EMAIL PROTECTED]> wrote:



No, you shouldn't. We could theoretically introduce a new API for this,
but I think it would be preferable if you can fix the race in the fs.



Actually, I might be able to do better.

When making a StoreData call to the AFS server, I send all the parameters
first, and at that point, the server will abort it, I think, if permission is
not available, and won't wait for the payload to be delivered.

So if I tell AF_RXRPC to send the parameter data with an explicit ACK request
and then wait till it's either hard-ACK'd or aborted, I should then be able to
deal with the permissions failure at a state where I have locked *all* the
pages to be sent.

At that point, I should be able to tell truncate to simple discard all these
locked pages.

How's that sound?


Truncate as it stands still needs to be given unlocked pages. So we would
either have to create a new API, or I think preferably it would be nice if
you could see if you can first solve it with a private lock?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Kyle Moffett

On May 17, 2007, at 13:45:33, Evgeniy Polyakov wrote:
On Thu, May 17, 2007 at 07:26:07PM +0200, Jan Engelhardt  
([EMAIL PROTECTED]) wrote:
My plan was to move this code to lib/ sooner or later.  If you  
consider it useful in its current state, I can do it immediatly.   
And if someone else merged a superior btree library I'd happily  
remove mine and use the new one instead.


Opinions?


Why would we need another btree, when there is lib/rbtree.c?  Or  
does yours do something fundamentally different?


It is not red-black tree, it is b+ tree.


It might be better to use the prefix "bptree" to help prevent  
confusion.  A quick google search on "bp-tree" reveals only the perl B 
+-tree module "Tree::BPTree", a U-Maryland Java CS project on B+- 
trees, and a news article about a "BP tree-top protest".


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-17 Thread Neil Brown
On Friday May 18, [EMAIL PROTECTED] wrote:
>  Fix confirmed, filled the whole 11T hard disk, without crashing.
> I presume this would go into 2.6.22

Yes, and probably 2.6.21.y, though the patch will be slightly
different, see below.
> 
> Thanks again.

And thank-you for pursuing this with me.

NeilBrown


---
Avoid overflow in raid0 calculation with large components.

If a raid0 has a component device larger than 4TB, and is accessed on
a 32bit machines, then as 'chunk' is unsigned lock,
   chunk << chunksize_bits
can overflow (this can be as high as the size of the device in KB).
chunk itself will not overflow (without triggering a BUG).

So change 'chunk' to be 'sector_t, and get rid of the 'BUG' as it becomes
impossible to hit.

Cc: "Jeff Zheng" <[EMAIL PROTECTED]>
Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/raid0.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
--- .prev/drivers/md/raid0.c2007-05-17 10:33:30.0 +1000
+++ ./drivers/md/raid0.c2007-05-17 16:14:12.0 +1000
@@ -415,7 +415,7 @@ static int raid0_make_request (request_q
raid0_conf_t *conf = mddev_to_conf(mddev);
struct strip_zone *zone;
mdk_rdev_t *tmp_dev;
-   unsigned long chunk;
+   sector_t chunk;
sector_t block, rsect;
const int rw = bio_data_dir(bio);
 
@@ -470,7 +470,6 @@ static int raid0_make_request (request_q
 
sector_div(x, zone->nb_dev);
chunk = x;
-   BUG_ON(x != (sector_t)chunk);
 
x = block >> chunksize_bits;
tmp_dev = zone->dev[sector_div(x, zone->nb_dev)];
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Dongjun Shin

Hi,

On 5/18/07, Pavel Machek <[EMAIL PROTECTED]> wrote:

Hi!


Hmm.. so operating your camera on batteries should be against the
warranty, since batteries commonly run empty while storing pictures?




AFAIK, the camera stops writing to the flash card and automatically
turns off when it's low on battery (before empty).
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jamie Lokier
Jörn Engel wrote:
> > Almost all your static functions start with logfs_, why not this one?
> 
> Because after a while I discovered how silly it is to start every
> function with logfs_.  That prefix doesn't add much unless the function
> has global scope.  What I didn't do was remove the prefix from older
> functions.

It's handy when debugging or showing detailed backtraces.  Not that
I'm advocating it (or not), just something I've noticed in other
programs.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-17 Thread Jeff Zheng
 Fix confirmed, filled the whole 11T hard disk, without crashing.
I presume this would go into 2.6.22

Thanks again.

Jeff

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Jeff Zheng
> Sent: Thursday, 17 May 2007 5:39 p.m.
> To: Neil Brown; [EMAIL PROTECTED]; Michal Piotrowski; Ingo 
> Molnar; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; linux-fsdevel@vger.kernel.org
> Subject: RE: Software raid0 will crash the file-system, when 
> each disk is 5TB
> 
> 
> Yeah, seems you've locked it down, :D. I've written 600GB of 
> data now, and anything is still fine.
> Will let it run overnight, and fill the whole 11T. I'll post 
> the result tomorrow
> 
> Thanks a lot though.
> 
> Jeff 
> 
> > -Original Message-
> > From: Neil Brown [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, 17 May 2007 5:31 p.m.
> > To: [EMAIL PROTECTED]; Jeff Zheng; Michal Piotrowski; Ingo Molnar; 
> > [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> > linux-fsdevel@vger.kernel.org
> > Subject: RE: Software raid0 will crash the file-system, 
> when each disk 
> > is 5TB
> > 
> > On Thursday May 17, [EMAIL PROTECTED] wrote:
> > > 
> > > Uhm, I just noticed something.
> > > 'chunk' is unsigned long, and when it gets shifted up, we
> > might lose
> > > bits.  That could still happen with the 4*2.75T 
> arrangement, but is 
> > > much more likely in the 2*5.5T arrangement.
> > 
> > Actually, it cannot be a problem with the 4*2.75T arrangement.
> >   chuck << chunksize_bits
> > 
> > will not exceed the size of the underlying device *in*kilobytes*.
> > In that case that is 0xAE9EC800 which will git in a 32bit long.
> > We don't double it to make sectors until after we add
> > zone->dev_offset, which is "sector_t" and so 64bit 
> arithmetic is used.
> > 
> > So I'm quite certain this bug will cause exactly the problems 
> > experienced!!
> > 
> > > 
> > > Jeff, can you try this patch?
> > 
> > Don't bother about the other tests I mentioned, just try this one.
> > Thanks.
> > 
> > NeilBrown
> > 
> > > Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
> > > 
> > > ### Diffstat output
> > >  ./drivers/md/raid0.c |2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
> > > --- .prev/drivers/md/raid0.c  2007-05-17 
> > 10:33:30.0 +1000
> > > +++ ./drivers/md/raid0.c  2007-05-17 15:02:15.0 +1000
> > > @@ -475,7 +475,7 @@ static int raid0_make_request (request_q
> > >   x = block >> chunksize_bits;
> > >   tmp_dev = zone->dev[sector_div(x, zone->nb_dev)];
> > >   }
> > > - rsect = (((chunk << chunksize_bits) + zone->dev_offset)<<1)
> > > + rsect = sector_t)chunk << chunksize_bits) +
> > > +zone->dev_offset)<<1)
> > >   + sect_in_chunk;
> > >   
> > >   bio->bi_bdev = tmp_dev->bdev;
> > 
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-raid" in the body of a message to 
> [EMAIL PROTECTED] More majordomo info at  
> http://vger.kernel.org/majordomo-info.html
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jörn Engel
On Thu, 17 May 2007 23:36:13 +0200, Arnd Bergmann wrote:
> On Thursday 17 May 2007, Pekka Enberg wrote:
> > 
> > So any sane way to enable compression is on per-inode basis which makes 
> > me still wonder why you need per-object compression.
> 
> 1. it doesn't require user interaction, the file system will do the right
> thing most of the time.
> 
> 2. enlarging data is a very bad thing because it makes the behaviour
> of the fs unpredictable. With uncompressed objects, you have a guaranteed
> upper bound on the size.

Correct.  The compression decision is always per-object.  Per-inode is a
hint from userspace that a compression attempt would be futile.

A compression algorithm that compresses any data is provably impossible.
Some data will always cause expansion instead of compression.  Some
algorithms have a well-known upper bound on the expansion, others don't.
So LogFS instead creates its own upper bound by reserving one byte in
the header for the compression type.

And while one bit would suffice as a compressed/uncompressed flag,
having a byte allows to support more than one compression algorithm.
LZO looks promising and is on its way into the kernel.  Others may come
in the future.

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jörn Engel
On Thu, 17 May 2007 23:00:20 +0200, Arnd Bergmann wrote:
> 
> Just using nanoseconds probably doesn't gain you much after all
> then. You could however just have separate 32 bit fields in the
> inode for seconds and nanoseconds, that will result in the exact
> same layout that you have right now, but won't require a conversion
> function.

I could also have a 30bit and a 34bit field.  30bit is enough for
nanoseconds.  So many options.

Jörn

-- 
Time? What's that? Time is only worth what you do with it.
-- Theo de Raadt
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Arnd Bergmann
On Thursday 17 May 2007, Pekka Enberg wrote:
> 
> Jörn Engel wrote:
> > Compressing random data will actually enlarge it.  If that happens I
> > simply store the verbatim uncompressed data instead and mark it as such.
> > 
> > There is also demand for a user-controlled bit in the inode to disable
> > compression completely.  All those .jpg, .mpg, .mp3, etc. just waste
> > time by trying and failing to compress them.
> 
> So any sane way to enable compression is on per-inode basis which makes 
> me still wonder why you need per-object compression.

1. it doesn't require user interaction, the file system will do the right
thing most of the time.

2. enlarging data is a very bad thing because it makes the behaviour
of the fs unpredictable. With uncompressed objects, you have a guaranteed
upper bound on the size.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Arnd Bergmann
On Thursday 17 May 2007, Jörn Engel wrote:
> 
> > Why not just store 64 bit nanoseconds? that would avoid the problem
> > with ns overflow and the year-2038 bug. OTOH, that would require
> > a 64 bit integer division when reading the data, so it gets you
> > a runtime overhead.
> 
> I like the idea.  Do conversion function exist both way?
> 
> What I don't get is the year-2038 bug.  Isn't that the 31bit limit,
> while 32bit would last to 2106?

You're right, you don't hit the 2038 bug here, because you use an
unsigned variable. The bug exists elsewhere because time_t tv_sec
is signed.

Just using nanoseconds probably doesn't gain you much after all
then. You could however just have separate 32 bit fields in the
inode for seconds and nanoseconds, that will result in the exact
same layout that you have right now, but won't require a conversion
function.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Pekka Enberg

Jörn Engel wrote:

Compressing random data will actually enlarge it.  If that happens I
simply store the verbatim uncompressed data instead and mark it as such.

There is also demand for a user-controlled bit in the inode to disable
compression completely.  All those .jpg, .mpg, .mp3, etc. just waste
time by trying and failing to compress them.


So any sane way to enable compression is on per-inode basis which makes 
me still wonder why you need per-object compression.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Pavel Machek
Hi!

> >Yes. These things are almost always implemented _very_ 
> >badly by the same
> >kind of crack-smoking hobo they drag in off the streets 
> >to write BIOSen.
> >
> >It's bog-roll technology; if you fancy a laugh try 
> >doing some real
> >reliability tests on them time some. Powerfail testing 
> >is a good one.
> >
> >This kind of thing is OK for disposable storage such as 
> >in digital
> >cameras, where it doesn't matter that it's no more 
> >reliable than a
> >floppy disc, but for real long-term storage it's really 
> >a bad idea.
> >
> 
> There are so many flash-based storage and some 
> disposable storages,
> as you pointed out, have poor quality. I think it's 
> mainly because these
> are not designed for good quality, but for lowering the 
> price.
> 
> These kind of devices are not ready for things like 
> power failure because
> their use case is far from that. For example, removing 
> flash card
> while taking pictures using digital camera is not a 
> common use case.
> (there should be a written notice that this kind of 
> action is against
> the warranty)

Hmm.. so operating your camera on batteries should be against the
warranty, since batteries commonly run empty while storing pictures?


Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jörn Engel
On Thu, 17 May 2007 17:08:51 +0200, Arnd Bergmann wrote:
> On Tuesday 15 May 2007, Jörn Engel wrote:
> > Add LogFS, a scalable flash filesystem.
> 
> Sorry for not commenting earlier, there were so many discussions on version
> two that I wanted to wait for the fallout of that instead of duplicating
> all the comments.

You are the last person that has to be sorry. ;)

> Here are a few things I notice while going through the third version:
> 
> > +/*
> > + * Private errno for accessed beyond end-of-file.  Only used internally to
> > + * logfs.  If this ever gets exposed to userspace or even other parts of 
> > the
> > + * kernel, it is a bug.  256 was chosen as a number sufficiently above all
> > + * used errno #defines.
> > + *
> > + * It can be argued that this is a hack and should be replaced with 
> > something
> > + * else.  My last attempt to do this failed spectacularly and there are 
> > more
> > + * urgent problems that users actually care about.  This will remain for 
> > the
> > + * moment.  Patches are wellcome, of course.
> > + */
> > +#define EOF256
> 
> It should at least be in the kernel-only errno range between 512 and 4095,
> that way it can eventually be added to include/linux/errno.h.

Fair enough.  512 it is.

> > + * Target rename works in three atomic steps:
> > + * 1. Attach old inode to new dentry (remember old dentry and new inode)
> > + * 2. Remove old dentry (still remember the new inode)
> > + * 3. Remove new inode
> > + *
> > + * Here we remember both an inode an a dentry.  If we get interrupted
> > + * between steps 1 and 2, we delete both the dentry and the inode.  If
> > + * we get interrupted between steps 2 and 3, we delete just the inode.
> > + * In either case, the remaining objects are deleted on next mount.  From
> > + * a users point of view, the operation succeeded.
> 
> This description had me confused for a while: why would you remove the
> new inode. Maybe change the text to say 'target inode' or 'victim inode'?

'Victim inode' sounds good.  Will do.

> > +static int logfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
> > +{
> > +   struct inode *inode;
> > +
> > +   if (dir->i_nlink >= LOGFS_LINK_MAX)
> > +   return -EMLINK;
> 
> Why is i_nlink limited? Don't you run out of space for inodes before
> overflowing?

I don't know.  With the current limit of 2^31, a sufficiently large
device can reach the limit.  And it is imaginable that overflowing the
s32 number space can expose security holes.  Not that I actually know,
the check is pure paranoia.

> > + * In principle, this function should loop forever, looking for GC 
> > candidates
> > + * and moving data.  LogFS is designed in such a way that this loop is
> > + * guaranteed to terminate.
> > + *
> > + * Limiting the loop to four iterations serves purely to catch cases when
> > + * these guarantees have failed.  An actual endless loop is an obvious bug
> > + * and should be reported as such.
> > + *
> > + * But there is another nasty twist to this.  As I have described in my LCA
> > + * presentation, Garbage collection would have to limit itself to higher
> > + * levels if the number of available free segments goes down.  This code
> > + * doesn't and should fail spectacularly.  Yet - hard as I tried I haven't
> > + * been able to make it fail (short of a bug elsewhere).
> > + *
> > + * So in a way this code is intentionally wrong as a desperate cry for a
> > + * better testcase.  And I do expect to get blamed for it one day. :(
> > + */
> 
> Could you bug the code to reserve fewer segments for GC than you really
> need, in order to stress test GC?

I could.  Wear leveling will cause changes in the area, so I'll have a
closer look when implementing that.

> > +static struct inode *logfs_alloc_inode(struct super_block *sb)
> > +{
> > +   struct logfs_inode *li;
> > +
> > +   li = kmem_cache_alloc(logfs_inode_cache, GFP_KERNEL);
> > +   if (!li)
> > +   return NULL;
> > +   logfs_init_inode(&li->vfs_inode);
> > +   return &li->vfs_inode;
> > +}
> > +
> > +
> > +struct inode *logfs_new_meta_inode(struct super_block *sb, u64 ino)
> > +{
> > +   struct inode *inode;
> > +
> > +   inode = logfs_alloc_inode(sb);
> > +   if (!inode)
> > +   return ERR_PTR(-ENOMEM);
> > +
> > +   logfs_init_inode(inode);
> 
> logfs_alloc_inode() returns an initialized inode, so no need to call
> logfs_init_inode() again, right?

Right.  Will change.

> > +static __be64 timespec_to_be64(struct timespec tsp)
> > +{
> > +   u64 time = ((u64)tsp.tv_sec << 32) + (tsp.tv_nsec & 0x);
> > +
> > +   WARN_ON(tsp.tv_nsec > 9);
> > +   return cpu_to_be64(time);
> > +}
> 
> Why not just store 64 bit nanoseconds? that would avoid the problem
> with ns overflow and the year-2038 bug. OTOH, that would require
> a 64 bit integer division when reading the data, so it gets you
> a runtime overhead.

I like the idea.  Do conversion function exist both way?

What I don't get is the year-2038 bug. 

Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-17 Thread David Howells
David Howells <[EMAIL PROTECTED]> wrote:

> When making a StoreData call to the AFS server, I send all the parameters
> first, and at that point, the server will abort it, I think, if permission is
> not available, and won't wait for the payload to be delivered.

Make that "should abort it".  openafs-1.4.2 seems to let you execute a
StoreData, even if you don't have permission to do so.  I can tell I don't
have permission to do so because the reply from the StoreData op says so...

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Evgeniy Polyakov
On Thu, May 17, 2007 at 07:26:07PM +0200, Jan Engelhardt ([EMAIL PROTECTED]) 
wrote:
> >My plan was to move this code to lib/ sooner or later.  If you consider
> >it useful in its current state, I can do it immediatly.  And if someone
> >else merged a superior btree library I'd happily remove mine and use the
> >new one instead.
> >
> >Opinions?
> 
> Why would we need another btree, when there is lib/rbtree.c?
> Or does yours do something fundamentally different?

It is not red-black tree, it is b+ tree.

>   Jan

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jan Engelhardt

On May 16 2007 02:06, Jörn Engel wrote:
>
>> > +/* memtree.c */
>> > +void btree_init(struct btree_head *head);
>> > +void *btree_lookup(struct btree_head *head, long val);
>> > +int btree_insert(struct btree_head *head, long val, void *ptr);
>> > +int btree_remove(struct btree_head *head, long val);
>> 
>> These names are too generic.  If we later add a btree library: blam.
>
>My plan was to move this code to lib/ sooner or later.  If you consider
>it useful in its current state, I can do it immediatly.  And if someone
>else merged a superior btree library I'd happily remove mine and use the
>new one instead.
>
>Opinions?

Why would we need another btree, when there is lib/rbtree.c?
Or does yours do something fundamentally different?


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jan Engelhardt

On May 16 2007 15:53, Jörn Engel wrote:
>
>My experience is that no matter which name I pick, people will complain
>anyway.  Previous suggestions included:
[...]
>
>Plus today:
>FFFS
>flashfs
>fredfs
>bob
>shizzle
>
>Imo they all suck.  LogFS also sucks, but it allows me to make a stupid
>joke and keep my logfs.org domain.

Try woodfs! (log - wood - get it?)
But finding names can be so tiresome, just give it a Borg-style
designation - "filesystem 125" or so. fs2007q1, being this
quartal's new filesystem.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jan Engelhardt

On May 16 2007 22:06, CaT wrote:
>On Wed, May 16, 2007 at 01:50:03PM +0200, J??rn Engel wrote:
>> On Wed, 16 May 2007 12:34:34 +0100, Jamie Lokier wrote:
>> > 
>> > But if akpm can't pronounce it, how about FFFS for faster flash
>> > filesystem ;-)
>> 
>> How many of you have worked for IBM before?  Vowels are not evil. ;)
>> 
>> Grouping four or more consonants to name anything will cause similar
>> expressions on people's faces.  Numbers don't help much either.
>> 
>> Ext2 is a great name, because "ext" actually is a pronouncable syllable.
>> MinixFS, ChunkFS, TileFS are great too.  XFS and JFS are ok, at least
>> they only have three consonants.  But FFS exists, so I'd rather go for a
>> syllable.
>
>FlashFS?

Or just try once dropping all those redundant 'fs' suffixes.
bdev, proc, cpuset, devpts, mqueue, fuse(blk|ctl), vfat, iso9660, etc.
Then there's much more space for innovative names.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jan Engelhardt

On May 16 2007 14:55, Jörn Engel wrote:
>On Wed, 16 May 2007 16:29:22 +0400, Evgeniy Polyakov wrote:
>> On Wed, May 16, 2007 at 01:50:03PM +0200, Jörn Engel ([EMAIL PROTECTED]) 
>> wrote:
>> > On Wed, 16 May 2007 12:34:34 +0100, Jamie Lokier wrote:
>> > > 
>> > > But if akpm can't pronounce it, how about FFFS for faster flash
>> > > filesystem ;-)
>> > 
>> > How many of you have worked for IBM before?  Vowels are not evil. ;)
>> 
>> Do you think 'eieio' is a good set? IBM's work too...

C'mon, UIO does not cut IIO either ;-)


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jan Engelhardt

On May 16 2007 13:09, Jörn Engel wrote:
>On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote:
>> 
>> Personally I'd just go for 'JFFS3'. After all, it has a better claim to
>> the name than either of its predecessors :)
>
>Did you ever see akpm's facial expression when he tried to pronounce
>"JFFS2"?  ;)

Is there something special with [dʒeɪ ɛf ɛf ɛs tuː]?


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Review status (Re: [PATCH] LogFS take three)

2007-05-17 Thread Jörn Engel
On Thu, 17 May 2007 20:03:11 +0400, Evgeniy Polyakov wrote:
> 
> Is logfs 32bit fs or 674bit, since although you use 64bit values for
> offsets, area management and strange converstions like described below 
> from offset into segment number are performed in 32bit?
> Is it enough for SSD for example to be 32bit only? Or if it is 64bit,
> could you please explain logic behind area management?

Ignoring bugs and signed return values for error handling, it is either
64bit or 32+32bit.

Inode numbers and file positions are 64bit.  Offsets are 64bit as well.
In a couple of places, offsets are also 32+32bit.  Basically the high
bits contain the segment number, the lower bits the offset within a
segment.

Side note: It would be nicer if the high 32bit were segment number.
Instead the number of bits depends on segment size.  Guess I should
change that while the format isn't fixed yet.

An "area" is a segment that is currently being written.  Data is
appended to this segment as it comes in, until the segment is full.  Any
functions dealing with areas only need a 32bit offset, which is the
offset within the area, not the absolute device offset.

Writes within an area are also buffered.  New data first goes into the
write buffer (wbuf) and only when this is full is it flushed to the
device.  NAND flash and some NOR flashes require such buffering.  When
writing to the device, the 32bit segno and the 32bit in-segment offset
need to get converted back to a 64bit device offset.

> I've found that you store segment numbers as 32bit values (for example
> in prepare_write()), and convert requested 64bit offset into segment
> number via superblock's s_segshift.

Yes, as described above.

> This conversation seems confusing to me in case of real 64bit offsets.
> For example this one obtained via prepare_write:
> 
> 7  1 logfs_prepare_write78  fs/logfs/file.c
> 8  2 logfs_readpage_nolock20  fs/logfs/file.c
> 9  1 logfs_read_block   351  fs/logfs/readwrite.c
> 10  1 logfs_read_loop   139  fs/logfs/readwrite.c
> 11  2 logfs_segment_read   108  fs/logfs/readwrite.c
> 12  1 wbuf_read 289 
> 
> u32 segno = ofs >> super->s_segshift;
> 
> ofs is originally obtained from inode's li_data array, which is filled
> with raw segment numbers which can be 64bit (here is another issue,
> since logfs_segment_write() returns signed, so essentially logfs is
> 63bit filesystem).

The filesystem format is 64bit.  The current code can only deal with
63bit.  Eric Sandeen just fixed ext2 to actually deal with 32bit
numbers and the same is possible for logfs.  If anyone ever cares...

> But here I've came to area management in logfs, and found that it is
> 32bit only, for example __logfs_segment_write()/__logfs_get_free_bytes() 
> returns signed 32 bit value (so it is reduced to 31 bit), which is then 
> placed into li_data as 64bit value. The latter
> (__logfs_get_free_bytes()) truncates 64bit data value obtained via
> dev_ofs() into signed 32 bit value.

That indeed is a bug.  __logfs_get_free_bytes() should return s64
instead of s32.  Will fix immediatly.

If anyone can find similar bugs, the bounty is a beer or non-alcoholic
beverage of choice. :)

Jörn

-- 
To announce that there must be no criticism of the President, or that we
are to stand by the President, right or wrong, is not only unpatriotic
and servile, but is morally treasonable to the American public.
-- Theodore Roosevelt, Kansas City Star, 1918
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Review status (Re: [PATCH] LogFS take three)

2007-05-17 Thread Evgeniy Polyakov

Hi Jörn.

Is logfs 32bit fs or 674bit, since although you use 64bit values for
offsets, area management and strange converstions like described below 
from offset into segment number are performed in 32bit?
Is it enough for SSD for example to be 32bit only? Or if it is 64bit,
could you please explain logic behind area management?

I've found that you store segment numbers as 32bit values (for example
in prepare_write()), and convert requested 64bit offset into segment
number via superblock's s_segshift.
This conversation seems confusing to me in case of real 64bit offsets.
For example this one obtained via prepare_write:

7  1 logfs_prepare_write78  fs/logfs/file.c
8  2 logfs_readpage_nolock20  fs/logfs/file.c
9  1 logfs_read_block   351  fs/logfs/readwrite.c
10  1 logfs_read_loop   139  fs/logfs/readwrite.c
11  2 logfs_segment_read   108  fs/logfs/readwrite.c
12  1 wbuf_read 289 

u32 segno = ofs >> super->s_segshift;

ofs is originally obtained from inode's li_data array, which is filled
with raw segment numbers which can be 64bit (here is another issue,
since logfs_segment_write() returns signed, so essentially logfs is
63bit filesystem).

But here I've came to area management in logfs, and found that it is
32bit only, for example __logfs_segment_write()/__logfs_get_free_bytes() 
returns signed 32 bit value (so it is reduced to 31 bit), which is then 
placed into li_data as 64bit value. The latter
(__logfs_get_free_bytes()) truncates 64bit data value obtained via
dev_ofs() into signed 32 bit value.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-17 Thread Jan Engelhardt

On May 17 2007 21:11, Neil Brown wrote:
>On Thursday May 17, [EMAIL PROTECTED] wrote:
>> XOR it (0^0=1), and hence fills up the host disk.
>
>Uhmm... you need to check your maths.
>
>$ perl -e 'printf "%d\n", 0^0;'
>0
>
>:-)

(ouch)
You know just as I that ^ is the power operator!
I just... wrongly named it XOR :p

$ echo '0^0' | bc -l
1


Well, right, setting up a blank raid5 array inside vmware will not make
the host file significantly larger, making it easy to build megatera
arrays with gigabyte range host disks.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Arnd Bergmann
On Tuesday 15 May 2007, Jörn Engel wrote:
> Add LogFS, a scalable flash filesystem.

Hi Jörn,

Sorry for not commenting earlier, there were so many discussions on version
two that I wanted to wait for the fallout of that instead of duplicating
all the comments.

Here are a few things I notice while going through the third version:

> +/*
> + * Private errno for accessed beyond end-of-file.  Only used internally to
> + * logfs.  If this ever gets exposed to userspace or even other parts of the
> + * kernel, it is a bug.  256 was chosen as a number sufficiently above all
> + * used errno #defines.
> + *
> + * It can be argued that this is a hack and should be replaced with something
> + * else.  My last attempt to do this failed spectacularly and there are more
> + * urgent problems that users actually care about.  This will remain for the
> + * moment.  Patches are wellcome, of course.
> + */
> +#define EOF  256

It should at least be in the kernel-only errno range between 512 and 4095,
that way it can eventually be added to include/linux/errno.h.

> + * Target rename works in three atomic steps:
> + * 1. Attach old inode to new dentry (remember old dentry and new inode)
> + * 2. Remove old dentry (still remember the new inode)
> + * 3. Remove new inode
> + *
> + * Here we remember both an inode an a dentry.  If we get interrupted
> + * between steps 1 and 2, we delete both the dentry and the inode.  If
> + * we get interrupted between steps 2 and 3, we delete just the inode.
> + * In either case, the remaining objects are deleted on next mount.  From
> + * a users point of view, the operation succeeded.

This description had me confused for a while: why would you remove the
new inode. Maybe change the text to say 'target inode' or 'victim inode'?

> +static int logfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
> +{
> + struct inode *inode;
> +
> + if (dir->i_nlink >= LOGFS_LINK_MAX)
> + return -EMLINK;

Why is i_nlink limited? Don't you run out of space for inodes before
overflowing?

> + * In principle, this function should loop forever, looking for GC candidates
> + * and moving data.  LogFS is designed in such a way that this loop is
> + * guaranteed to terminate.
> + *
> + * Limiting the loop to four iterations serves purely to catch cases when
> + * these guarantees have failed.  An actual endless loop is an obvious bug
> + * and should be reported as such.
> + *
> + * But there is another nasty twist to this.  As I have described in my LCA
> + * presentation, Garbage collection would have to limit itself to higher
> + * levels if the number of available free segments goes down.  This code
> + * doesn't and should fail spectacularly.  Yet - hard as I tried I haven't
> + * been able to make it fail (short of a bug elsewhere).
> + *
> + * So in a way this code is intentionally wrong as a desperate cry for a
> + * better testcase.  And I do expect to get blamed for it one day. :(
> + */

Could you bug the code to reserve fewer segments for GC than you really
need, in order to stress test GC?

> +static struct inode *logfs_alloc_inode(struct super_block *sb)
> +{
> + struct logfs_inode *li;
> +
> + li = kmem_cache_alloc(logfs_inode_cache, GFP_KERNEL);
> + if (!li)
> + return NULL;
> + logfs_init_inode(&li->vfs_inode);
> + return &li->vfs_inode;
> +}
> +
> +
> +struct inode *logfs_new_meta_inode(struct super_block *sb, u64 ino)
> +{
> + struct inode *inode;
> +
> + inode = logfs_alloc_inode(sb);
> + if (!inode)
> + return ERR_PTR(-ENOMEM);
> +
> + logfs_init_inode(inode);

logfs_alloc_inode() returns an initialized inode, so no need to call
logfs_init_inode() again, right?

> +static __be64 timespec_to_be64(struct timespec tsp)
> +{
> + u64 time = ((u64)tsp.tv_sec << 32) + (tsp.tv_nsec & 0x);
> +
> + WARN_ON(tsp.tv_nsec > 9);
> + return cpu_to_be64(time);
> +}

Why not just store 64 bit nanoseconds? that would avoid the problem
with ns overflow and the year-2038 bug. OTOH, that would require
a 64 bit integer division when reading the data, so it gets you
a runtime overhead.

> +static void logfs_read_inode(struct inode *inode)
> +{
> + int ret;
> +
> + BUG_ON(inode->i_ino == LOGFS_INO_MASTER);
> +
> + ret = __logfs_read_inode(inode);
> +
> + /* What else can we do here? */
> + BUG_ON(ret);
> +}

ext2 returns make_bad_inode(inode) in this case, which seems to be
a better solution than crashing.

> +int __logfs_write_inode(struct inode *inode)
> +{
> + /*
> +  * FIXME: Those two inodes are 512 bytes in total.  Not good to
> +  * have on the stack.  Possibly the best solution would be to bite
> +  * the bullet and do another format change before release and
> +  * shrink the inodes.
> +  */
> + struct logfs_disk_inode old, new;
> +
> + BUG_ON(inode->i_ino == LOGFS_INO_MASTER);
> +
> + /* read and compare the inode first.  If it hasn't

[PATCH 6/6][TAKE4] ext4: write support for preallocated blocks

2007-05-17 Thread Amit K. Arora
This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

Changelog:
-
Changes from Take3 to Take4:
 - no change -
Changes from Take2 to Take3:
 1) Patch now rebased to 2.6.22-rc1 kernel.
Changes from Take1 to Take2:
 1) Replaced BUG_ON with WARN_ON & ext4_error.
 2) Added variable names to the function declaration of
ext4_ext_try_to_merge().
 3) Updated variable declarations to use multiple-definitions-per-line.
 4) "if((a=foo())).." was broken into "a=foo(); if(a).."
 5) Removed extra spaces.

Here is the updated patch:

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>
---
 fs/ext4/extents.c   |  234 +++-
 include/linux/ext4_fs_extents.h |3 
 2 files changed, 210 insertions(+), 27 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -1140,6 +1140,54 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * This function tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *ex)
+{
+   struct ext4_extent_header *eh;
+   unsigned int depth, len;
+   int merge_done = 0;
+   int uninitialized = 0;
+
+   depth = ext_depth(inode);
+   BUG_ON(path[depth].p_hdr == NULL);
+   eh = path[depth].p_hdr;
+
+   while (ex < EXT_LAST_EXTENT(eh))
+   {
+   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+   break;
+   /* merge with next extent! */
+   if (ext4_ext_is_uninitialized(ex))
+   uninitialized = 1;
+   ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+   + ext4_ext_get_actual_len(ex + 1));
+   if (uninitialized)
+   ext4_ext_mark_uninitialized(ex);
+
+   if (ex + 1 < EXT_LAST_EXTENT(eh)) {
+   len = (EXT_LAST_EXTENT(eh) - ex - 1)
+   * sizeof(struct ext4_extent);
+   memmove(ex + 1, ex + 2, len);
+   }
+   eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1);
+   merge_done = 1;
+   WARN_ON(eh->eh_entries == 0);
+   if (!eh->eh_entries)
+   ext4_error(inode->i_sb, "ext4_ext_try_to_merge",
+  "inode#%lu, eh->eh_entries = 0!", inode->i_ino);
+   }
+
+   return merge_done;
+}
+
+/*
  * check if a portion of the "newext" extent overlaps with an
  * existing extent.
  *
@@ -1327,25 +1375,7 @@ has_space:
 
 merge:
/* try to merge extents to the right */
-   while (nearex < EXT_LAST_EXTENT(eh)) {
-   if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-   break;
-   /* merge with next extent! */
-   if (ext4_ext_is_uninitialized(nearex))
-   uninitialized = 1;
-   nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-   + ext4_ext_get_actual_len(nearex + 1));
-   if (uninitialized)
-   ext4_ext_mark_uninitialized(nearex);
-
-   if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
-   len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-   * sizeof(struct ext4_extent);
-   memmove(nearex + 1, nearex + 2, len);
-   }
-   eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
-   BUG_ON(eh->eh_entries == 0);
-   }
+   ext4_ext_try_to_merge(inode, path, nearex);
 
/* try to merge extents to the left */
 
@@ -2011,15 +2041,152 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ *   a> There is no split required: Entire extent should be initialized
+ *   b> Splits in two extents: Write is happening at either end of the extent
+ *   c> Splits in three extents: Somone is writing in middle of the extent
+ */
+int ext4_ext_convert_to_initialized(handle_t *h

[PATCH 5/6][TAKE4] ext4: fallocate support in ext4

2007-05-17 Thread Amit K. Arora
This patch implements ->fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation.

Current implementation only supports preallocation for regular files
(directories not supported as of date) with extent maps. This patch
does not support block-mapped files currently.

Only FA_ALLOCATE mode is being supported as of now. Supporting
FA_DEALLOCATE mode is a  item.

Changelog:
-
Changes from Take3 to Take4:
 1) Changed ext4_fllocate() declaration and definition to return a "long"
and not an "int", to match with ->fallocate() inode op.
 2) Update ctime if new blocks get allocated.
Changes from Take2 to Take3:
 1) Patch rebased to 2.6.22-rc1 kernel version.
 2) Removed unnecessary "EXPORT_SYMBOL(ext4_fallocate);".
Changes from Take1 to Take2:
 1) Added more description for ext4_fallocate().
 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent).
 3) Moved journal_start & journal_stop inside the while loop.
 4) Replaced BUG_ON with WARN_ON & ext4_error.
 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally.
 6) Added variable names in the function declaration of ext4_fallocate()
 7) Converted macros that handle uninitialized extents into inline
functions.

Here is the updated patch:

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>
---
 fs/ext4/extents.c   |  249 +---
 fs/ext4/file.c  |1 
 include/linux/ext4_fs.h |8 +
 include/linux/ext4_fs_extents.h |   12 +
 4 files changed, 229 insertions(+), 41 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in
} else if (path->p_ext) {
ext_debug("  %d:%d:%llu ",
  le32_to_cpu(path->p_ext->ee_block),
- le16_to_cpu(path->p_ext->ee_len),
+ ext4_ext_get_actual_len(path->p_ext),
  ext_pblock(path->p_ext));
} else
ext_debug("  []");
@@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block),
- le16_to_cpu(ex->ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug("\n");
 }
@@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug("  -> %d:%llu:%d ",
le32_to_cpu(path->p_ext->ee_block),
ext_pblock(path->p_ext),
-   le16_to_cpu(path->p_ext->ee_len));
+   ext4_ext_get_actual_len(path->p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug("move %d:%llu:%d in new leaf %llu\n",
le32_to_cpu(path[depth].p_ext->ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext->ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1106,7 +1106,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1->ee_block) + ext1_ee_len !=
le32_to_cpu(ex2->ee_block))
return 0;
 
@@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1->ee_len) >= 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
retur

[PATCH 4/6][TAKE4] ext4: Extent overlap bugfix

2007-05-17 Thread Amit K. Arora
This patch adds a check for overlap of extents and cuts short the
new extent to be inserted, if there is a chance of overlap.

Changelog:
-
Changes from Take3 to Take4:
 - no change -
Changes from Take2 to Take3:
 1) Patch rebased to 2.6.22-rc1 kernel.
Changes from Take1 to Take2:
 1) As suggested by Andrew, a check for wrap though zero has been added.

Here is the new patch:

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>
---
 fs/ext4/extents.c   |   60 ++--
 include/linux/ext4_fs_extents.h |1 
 2 files changed, 59 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -1128,6 +1128,55 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * check if a portion of the "newext" extent overlaps with an
+ * existing extent.
+ *
+ * If there is an overlap discovered, it updates the length of the newext
+ * such that there will be no overlap, and then returns 1.
+ * If there is no overlap found, it returns 0.
+ */
+unsigned int ext4_ext_check_overlap(struct inode *inode,
+   struct ext4_extent *newext,
+   struct ext4_ext_path *path)
+{
+   unsigned long b1, b2;
+   unsigned int depth, len1;
+   unsigned int ret = 0;
+
+   b1 = le32_to_cpu(newext->ee_block);
+   len1 = le16_to_cpu(newext->ee_len);
+   depth = ext_depth(inode);
+   if (!path[depth].p_ext)
+   goto out;
+   b2 = le32_to_cpu(path[depth].p_ext->ee_block);
+
+   /*
+* get the next allocated block if the extent in the path
+* is before the requested block(s) 
+*/
+   if (b2 < b1) {
+   b2 = ext4_ext_next_allocated_block(path);
+   if (b2 == EXT_MAX_BLOCK)
+   goto out;
+   }
+
+   /* check for wrap through zero */
+   if (b1 + len1 < b1) {
+   len1 = EXT_MAX_BLOCK - b1;
+   newext->ee_len = cpu_to_le16(len1);
+   ret = 1;
+   }
+
+   /* check for overlap */
+   if (b1 + len1 > b2) {
+   newext->ee_len = cpu_to_le16(b2 - b1);
+   ret = 1;
+   }
+out:
+   return ret;
+}
+
+/*
  * ext4_ext_insert_extent:
  * tries to merge requsted extent into the existing extent or
  * inserts requested extent as new one into the tree,
@@ -2031,7 +2080,15 @@ int ext4_ext_get_blocks(handle_t *handle
 
/* allocate new block */
goal = ext4_ext_find_goal(inode, path, iblock);
-   allocated = max_blocks;
+
+   /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
+   newex.ee_block = cpu_to_le32(iblock);
+   newex.ee_len = cpu_to_le16(max_blocks);
+   err = ext4_ext_check_overlap(inode, &newex, path);
+   if (err)
+   allocated = le16_to_cpu(newex.ee_len);
+   else
+   allocated = max_blocks;
newblock = ext4_new_blocks(handle, inode, goal, &allocated, &err);
if (!newblock)
goto out2;
@@ -2039,7 +2096,6 @@ int ext4_ext_get_blocks(handle_t *handle
goal, newblock, allocated);
 
/* try to insert new extent into found leaf and return */
-   newex.ee_block = cpu_to_le32(iblock);
ext4_ext_store_pblock(&newex, newblock);
newex.ee_len = cpu_to_le16(allocated);
err = ext4_ext_insert_extent(handle, inode, path, &newex);
Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
===
--- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
@@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode *
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct 
ext4_ext_path *);
+extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent 
*, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct 
ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, 
ext_prepare_callback, void *);
 extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct 
ext4_ext_path *);
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6][TAKE4] fallocate() on ia64

2007-05-17 Thread Amit K. Arora
Here is the 2.6.22-rc1 version of David's patch: add fallocate() on ia64

From: David Chinner <[EMAIL PROTECTED]>
Subject: [PATCH] ia64 fallocate syscall
Cc: "Amit K. Arora" <[EMAIL PROTECTED]>, 
[EMAIL PROTECTED], [EMAIL PROTECTED],
[EMAIL PROTECTED], [EMAIL PROTECTED]

ia64 fallocate syscall support.

Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]>

---
 arch/ia64/kernel/entry.S  |1 +
 include/asm-ia64/unistd.h |3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S
===
--- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S  2007-05-12 
18:45:56.0 -0700
+++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S   2007-05-15 15:36:48.0 
-0700
@@ -1585,5 +1585,6 @@
data8 sys_getcpu
data8 sys_epoll_pwait   // 1305
data8 sys_utimensat
+   data8 sys_fallocate

.org sys_call_table + 8*NR_syscalls // guard against failures to 
increase NR_syscalls
Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h
===
--- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h 2007-05-12 
18:45:56.0 -0700
+++ linux-2.6.22-rc1/include/asm-ia64/unistd.h  2007-05-15 15:37:51.0 
-0700
@@ -296,6 +296,7 @@
 #define __NR_getcpu1304
 #define __NR_epoll_pwait   1305
 #define __NR_utimensat 1306
+#define __NR_fallocate 1307

 #ifdef __KERNEL__


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6][TAKE4] fallocate() on s390

2007-05-17 Thread Amit K. Arora
This is the patch suggested by Martin Schwidefsky to support
sys_fallocate() on s390(x) platform.

He also suggested a wrapper in glibc to handle this system call on
s390. Posting it here so that we get feedback for this too.

.globl __fallocate
ENTRY(__fallocate)
stm %r6,%r7,28(%r15)/* save %r6/%r7 on stack */
cfi_offset (%r7, -68)
cfi_offset (%r6, -72)
lm  %r6,%r7,96(%r15)/* load loff_t len from stack */
svc SYS_ify(fallocate)
lm  %r6,%r7,28(%r15)/* restore %r6/%r7 from stack */
br  %r14
PSEUDO_END(__fallocate)


Here are the comments and the patch to linux kernel from him.

-
From: Martin Schwidefsky <[EMAIL PROTECTED]>

This patch implements support of fallocate system call on s390(x)
platform. A wrapper is added to address the issue which s390 ABI has
with the arguments of this system call.

Signed-off-by: Martin Schwidefsky <[EMAIL PROTECTED]>
---
 arch/s390/kernel/compat_wrapper.S |   10 ++
 arch/s390/kernel/sys_s390.c   |   29 +
 arch/s390/kernel/syscalls.S   |1 +
 include/asm-s390/unistd.h |3 ++-
 4 files changed, 42 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S
===
--- linux-2.6.22-rc1.orig/arch/s390/kernel/compat_wrapper.S
+++ linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S
@@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper:
llgtr   %r2,%r2 # char *
llgtr   %r3,%r3 # struct compat_timeval *
jg  compat_sys_utimes
+
+   .globl  sys_fallocate_wrapper
+sys_fallocate_wrapper:
+   lgfr%r2,%r2 # int
+   lgfr%r3,%r3 # int
+   sllg%r4,%r4,32  # get high word of 64bit loff_t
+   lr  %r4,%r5 # get low word of 64bit loff_t
+   sllg%r5,%r6,32  # get high word of 64bit loff_t
+   l   %r5,164(%r15)   # get low word of 64bit loff_t
+   jg  sys_fallocate
Index: linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c
===
--- linux-2.6.22-rc1.orig/arch/s390/kernel/sys_s390.c
+++ linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c
@@ -265,3 +265,32 @@ s390_fadvise64_64(struct fadvise64_64_ar
return -EFAULT;
return sys_fadvise64_64(a.fd, a.offset, a.len, a.advice);
 }
+
+#ifndef CONFIG_64BIT
+/*
+ * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last
+ * 64 bit argument "len" is split into the upper and lower 32 bits. The
+ * system call wrapper in the user space loads the value to %r6/%r7.
+ * The code in entry.S keeps the values in %r2 - %r6 where they are and
+ * stores %r7 to 96(%r15). But the standard C linkage requires that
+ * the whole 64 bit value for len is stored on the stack and doesn't
+ * use %r6 at all. So s390_fallocate has to convert the arguments from
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len
+ * to
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len
+ */
+asmlinkage long s390_fallocate(int fd, int mode, loff_t offset,
+  u32 len_high, u32 len_low)
+{
+   union {
+   u64 len;
+   struct {
+   u32 high;
+   u32 low;
+   };
+   } cv;
+   cv.high = len_high;
+   cv.low = len_low;
+   return sys_fallocate(fd, mode, offset, cv.len);
+}
+#endif
Index: linux-2.6.22-rc1/arch/s390/kernel/syscalls.S
===
--- linux-2.6.22-rc1.orig/arch/s390/kernel/syscalls.S
+++ linux-2.6.22-rc1/arch/s390/kernel/syscalls.S
@@ -322,3 +322,4 @@ NI_SYSCALL  
/* 310 sys_move_pages *
 SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
 SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
 SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
+SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper)
Index: linux-2.6.22-rc1/include/asm-s390/unistd.h
===
--- linux-2.6.22-rc1.orig/include/asm-s390/unistd.h
+++ linux-2.6.22-rc1/include/asm-s390/unistd.h
@@ -251,8 +251,9 @@
 #define __NR_getcpu311
 #define __NR_epoll_pwait   312
 #define __NR_utimes313
+#define __NR_fallocate 314
 
-#define NR_syscalls 314
+#define NR_syscalls 315
 
 /* 
  * There are some system calls that are not present on 64 bit, some
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6][TAKE4] fallocate() implementation on i86, x86_64 and powerpc

2007-05-17 Thread Amit K. Arora
This patch implements sys_fallocate() and adds support on i386, x86_64
and powerpc platforms.

Changelog:
-
Changes from Take3 to Take4:
 1) Do not update c/mtime. Let each filesystem update ctime (update of
mtime will not be required for allocation since we touch only
metadata/inode and not blocks), if required.
Changes from Take2 to Take3:
 1) Patches now based on 2.6.22-rc1 kernel.
Changes from Take1(initial post on 26th April, 2007) to Take2:
 1) Added description before sys_fallocate() definition.
 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
posix_fallocate should return EINVAL for len <= 0.
 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
 4) Do not return ENODEV for dirs (let individual file systems decide if
they want to support preallocation to directories or not.
 5) Check for wrap through zero.
 6) Update c/mtime if fallocate() succeeds.
 7) Added mode descriptions in fs.h
 8) Added variable names to function definition (fallocate inode op)

Here is the new patch:

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>
---
 arch/i386/kernel/syscall_table.S |1 
 arch/powerpc/kernel/sys_ppc32.c  |7 +++
 arch/x86_64/ia32/ia32entry.S |1 
 fs/open.c|   86 +++
 include/asm-i386/unistd.h|3 -
 include/asm-powerpc/systbl.h |1 
 include/asm-powerpc/unistd.h |3 -
 include/asm-x86_64/unistd.h  |2 
 include/linux/fs.h   |   13 +
 include/linux/syscalls.h |1 
 10 files changed, 116 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.22-rc1.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
.long sys_signalfd
.long sys_timerfd
.long sys_eventfd
+   .long sys_fallocate
Index: linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c
===
--- linux-2.6.22-rc1.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c
@@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
return sys_truncate(path, (high << 32) | low);
 }
 
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+u32 lenhi, u32 lenlo)
+{
+   return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo,
+((loff_t)lenhi << 32) | lenlo);
+}
+
 asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long 
high,
 unsigned long low)
 {
Index: linux-2.6.22-rc1/fs/open.c
===
--- linux-2.6.22-rc1.orig/fs/open.c
+++ linux-2.6.22-rc1/fs/open.c
@@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned
 #endif
 
 /*
+ * sys_fallocate - preallocate blocks or free preallocated blocks
+ * @fd: the file descriptor
+ * @mode: mode specifies if fallocate should preallocate blocks OR free
+ *   (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
+ *   FA_DEALLOCATE modes are supported.
+ * @offset: The offset within file, from where (un)allocation is being
+ * requested. It should not have a negative value.
+ * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
+ *
+ * This system call, depending on the mode, preallocates or unallocates blocks
+ * for a file. The range of blocks depends on the value of offset and len
+ * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * system call succeeds, subsequent writes to the file in the given range
+ * (specified by offset & len) should not fail - even if the file system
+ * later becomes full. Hence the preallocation done is persistent (valid
+ * even after reopen of the file and remount/reboot).
+ *
+ * It is expected that the ->fallocate() inode operation implemented by the
+ * individual file systems will update the file size and/or ctime/mtime
+ * depending on the mode and also on the success of the operation.
+ *
+ * Note: Incase the file system does not support preallocation,
+ * posix_fallocate() should fall back to the library implementation (i.e.
+ * allocating zero-filled new blocks to the file).
+ *
+ * Return Values
+ * 0   : On SUCCESS a value of zero is returned.
+ * error   : On Failure, an error code will be returned.
+ * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate()
+ * fall back on library implementation of fallocate.
+ *
+ *  Generic fallocate to be added for file systems that do not
+ *  support fallocate it.
+ */
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long r

[PATCH 0/6][TAKE4] fallocate system call

2007-05-17 Thread Amit K. Arora
Description:
---
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.

Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.

Interface:
-
The proposed system call's layout is:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

fd: The descriptor of the open file.

mode*: This specifies the behavior of the system call. Currently the
  system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
  FA_ALLOCATE: Applications can use this mode to preallocate blocks to
a given file (specified by fd). This mode changes the file size if
the preallocation is done beyond the EOF. It also updates the
ctime in the inode of the corresponding file, marking a
successfull allocation.
  FA_DEALLOCATE: This mode can be used by applications to deallocate the
previously preallocated blocks. This also may change the file size
and the ctime/mtime.
* New modes might get added in future. One such new mode which is
  already under discussion is FA_PREALLOCATE, which when used will
  preallocate space but will not change the filesize and [cm]time.
  Since the semantics of this new mode is not clear and agreed upon yet,
  this patchset does not implement it currently.

offset: This is the offset in bytes, from where the preallocation should
  start.

len: This is the number of bytes requested for preallocation (from
  offset).
 
RETURN VALUE: The system call returns 0 on success and an error on
failure. This is done to keep the semantics same as of
posix_fallocate(). 

sys_fallocate() on s390:
---
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.

Known Problem:
-
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
->page_mkwrite() to solve this. See:
http://lkml.org/lkml/2007/5/8/583

Since there is a talk of ->fault() replacing ->page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:
http://lkml.org/lkml/2007/3/7/161
http://lkml.org/lkml/2007/3/18/198

ToDos:
-
1> Implementation on other architectures (other than i386, x86_64,
ppc64 and s390(x)). David Chinner has already posted a patch for ia64.
2> A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3> Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Changelog:
-
Changes from Take2 to Take3:
1) Return type is now described in the interface description
   above.
2) Patches rebased to 2.6.22-rc1 kernel.

** Each post will have an individual changelog for a particular patch.


Following patches follow:
Patch 1/6 : fallocate() implementation on i86, x86_64 and powerpc
Patch 2/6 : fallocate() on s390
Patch 3/6 : fallocate() on ia64
Patch 4/6 : ext4: Extent overlap bugfix
Patch 5/6 : ext4: fallocate support in ext4
Patch 6/6 : ext4: write support for preallocated blocks

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] file capabilities: Introduction

2007-05-17 Thread Serge E. Hallyn
Quoting Suparna Bhattacharya ([EMAIL PROTECTED]):
> On Mon, May 14, 2007 at 08:00:11PM +, Pavel Machek wrote:
> > Hi!
> > 
> > > "Serge E. Hallyn" <[EMAIL PROTECTED]> wrote:
> > > 
> > > > Following are two patches which have been sitting for some time in -mm.
> > > 
> > > Where "some time" == "nearly six months".
> > > 
> > > We need help considering, reviewing and testing this code, please.
> > 
> > I did quick scan, and it looks ok. Plus, it means we can finally start
> > using that old capabilities subsystem... so I think we should do it.
> 
> FWIW, I looked through it recently as well, and it looked reasonable enough
> to me, though I'm not a security expert. I did have a question about
> testing corner cases etc, which Serge has tried to address.
> 
> Serge, are you planning to post an update without STRICTXATTR ? That should
> simplify the second patch.

Sorry, I did but I guess I didn't cc: you on that reply.

It is at http://lkml.org/lkml/2007/5/14/276

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-17 Thread David Howells
Nick Piggin <[EMAIL PROTECTED]> wrote:

> No, you shouldn't. We could theoretically introduce a new API for this,
> but I think it would be preferable if you can fix the race in the fs.

Actually, I might be able to do better.

When making a StoreData call to the AFS server, I send all the parameters
first, and at that point, the server will abort it, I think, if permission is
not available, and won't wait for the payload to be delivered.

So if I tell AF_RXRPC to send the parameter data with an explicit ACK request
and then wait till it's either hard-ACK'd or aborted, I should then be able to
deal with the permissions failure at a state where I have locked *all* the
pages to be sent.

At that point, I should be able to tell truncate to simple discard all these
locked pages.

How's that sound?

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc

2007-05-17 Thread Amit K. Arora
On Thu, May 17, 2007 at 09:40:36AM +1000, David Chinner wrote:
> On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote:
> > On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote:
> > > On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote:
> > 
> > > > Following changes were made to the previous version:
> > > >  1) Added description before sys_fallocate() definition.
> > > >  2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
> > > > posix_fallocate should return EINVAL for len <= 0.
> > > >  3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
> > > >  4) Do not return ENODEV for dirs (let individual file systems decide if
> > > > they want to support preallocation to directories or not.
> > > >  5) Check for wrap through zero.
> > > >  6) Update c/mtime if fallocate() succeeds.
> > > 
> > > Please don't make this always happen. c/mtime updates should be dependent
> > > on the mode being used and whether there is visible change to the file. 
> > > If no
> > > userspace visible changes to the file occurred, then timestamps should not
> > > be changed.
> > 
> > i_blocks will be updated, so it seems reasonable to update ctime.  mtime
> > shouldn't be changed, though, since the contents of the file will be
> > unchanged.
> 
> That's assuming blocks were actually allocated - if the prealloc range already
> has underlying blocks there is no change and so we should not be changing
> mtime either. Only the filesystem will know if it has changed the file, so I
> think that timestamp updates need to be driven down to that level, not done
> blindy at the highest layer

Ok. Will make this change in the next post.

--
Regards,
Amit Arora
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Jörn Engel
On Thu, 17 May 2007 16:43:59 +0800, David Woodhouse wrote:
> 
> > As I mentioned, some techniques like log-structured filesystem could
> > perform generally better on any kind of flash-based storage with FTL.
> > Although there are many kinds of FTL, it is commonly true that
> > it performs well under workload where sequential write is dominant.
> 
> Yes, it's certainly possible that we _could_ write a file system which
> is specifically targeted at FTL -- I was just wondering why anyone would
> _bother_ :)

Haven't you done that already?  JFFS2 write behaviour is the best-case
scenario for any FTL.  When the delta cache is finished, LogFS will be
pretty close to that as well.

Not sure if anyone would specifically target FTL.  Being well-suited for
those beasts is just a side-effect.

The FTL is still a net loss.  Without that FAT enabling layer a real
flash filesystem would be more efficient.

Jörn

-- 
Prosperity makes friends, adversity tries them.
-- Publilius Syrus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc

2007-05-17 Thread Dave Kleikamp
On Thu, 2007-05-17 at 09:40 +1000, David Chinner wrote:
> On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote:
> > On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote:

> > > Please don't make this always happen. c/mtime updates should be dependent
> > > on the mode being used and whether there is visible change to the file. 
> > > If no
> > > userspace visible changes to the file occurred, then timestamps should not
> > > be changed.
> > 
> > i_blocks will be updated, so it seems reasonable to update ctime.  mtime
> > shouldn't be changed, though, since the contents of the file will be
> > unchanged.
> 
> That's assuming blocks were actually allocated - if the prealloc range already
> has underlying blocks there is no change and so we should not be changing
> mtime either. Only the filesystem will know if it has changed the file, so I
> think that timestamp updates need to be driven down to that level, not done
> blindy at the highest layer

Yes, I agree.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Arnd Bergmann
On Tuesday 15 May 2007, Jörn Engel wrote:
> 
> > I've been semi watching this, and the only comment I really can give
> > is that I hate the name.  To me, logfs implies a filesystem for
> > logging purposes, not for Flash hardware with wear leveling issues to
> > be taken into account.
> 
> Yeah, well, ...
> 
> Two years ago when I started all this, I was looking for a good name.
> All I could come up with sounded stupid, so I picked "LogFS" as a code
> name.  As soon as I find a better name, the code name should get
> replaced.
> 

When doing a google search on logfs, there are less than five results
among the first 100 that don't refer to your work. The other two listed
in there are also log-structured file systems: The inferno flash file
system (http://inferno-os.googlecode.com/svn/trunk/liblogfs/) and the
(discontinued) file system named lfs from the 2005 google summer of
code.

I'd say the name should stay, changing it now can only add more confusion.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-17 Thread Neil Brown
On Thursday May 17, [EMAIL PROTECTED] wrote:
> XOR it (0^0=1), and hence fills up the host disk.

Uhmm... you need to check your maths.

$ perl -e 'printf "%d\n", 0^0;'
0

:-)
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread David Woodhouse
On Thu, 2007-05-17 at 09:12 +, Pavel Machek wrote:
> Nah, it would lead to Jorn disappearing misteriously and _Pavel_
> accused of murder ;-).

Are you suggesting that you would murder Jörn (you misspelled his name)
merely for the heinous crime of using his own name?

Your Luddism was already quite excessive, but now you really _are_
taking it to extremes. :)

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Pavel Machek

> >> My experience is that no matter which name I pick, 
> >people will
> >> complain
> >> anyway.  Previous suggestions included:
> >> jffs3
> >> jefs
> >> engelfs
> >> poofs
> >> crapfs
> >> sweetfs
> >> cutefs
> >> dynamic journaling fs - djofs
> >> tfsfkal - the file system formerly known as logfs
> >
> >Can we call it jörnfs? :)
> 
> However if Jörn is accused of murder, it will have 
> little chance of
> being merged :-).

Nah, it would lead to Jorn disappearing misteriously and _Pavel_
accused of murder ;-).

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread David Woodhouse
On Thu, 2007-05-17 at 17:20 +0900, Dongjun Shin wrote:
> There are, of course, cases where direct access are better.
> However, as the demand for capacity, reliability and high performance
> for the flash storage increases, the use of FTL with embedded controller
> would be inevitable.
> 
> - The complexity/cost of host-side SW (like JFFS2/MTD) will increase due to
> the use of multiple flash in parallel and the use of high degree ECC 
> algorithm.
> There are other things like bad block handling and wear-leveling issues
> which has been recently touched by UBI with added SW complexity.

You don't get rid of that complexity by offloading it to a µcontroller. 

The only thing you achieve that way is making sure it's quite likely to
be done badly, and making it impossible to debug.

> - In contrast to the embedded environment where CPU and flash is directly
> connected, the I/O path between CPU and flash in PC environment is longer.
> The latency for SW handshaking between CPU and flash will also be longer,
> which would make the performance optimization harder.

Do it the naïve way with a single byte push/pull and waggling the
control lines separately, and what you say is true -- but you can have
flash controllers which assist with data transfer but still give you
essentially 'raw' access to the chip.

With the CAFÉ controller designed for the OLPC machine, we can spew data
across the PCI bus just as fast as we can suck it off the flash chip.

> As I mentioned, some techniques like log-structured filesystem could
> perform generally better on any kind of flash-based storage with FTL.
> Although there are many kinds of FTL, it is commonly true that
> it performs well under workload where sequential write is dominant.

Yes, it's certainly possible that we _could_ write a file system which
is specifically targeted at FTL -- I was just wondering why anyone would
_bother_ :)

I've seen an interesting file system which does have a kind of FTL
internally as its lowest layer, and which build on that using 'virtual'
sectors for file extents. Now that _does_ have its advantages -- but it
doesn't go as far as pretending to be a 'normal' block device; it's its
own special thing for internal use within that file system.
 
> I also expect that FTL for PC environment will have better quality spec
> than the disposable storage.

There really is no reason why FTL has to be done badly; just as there's
no _reason_ why hardware vendors have to give us crappy bsVendorCode.
Nevertheless, that's the way the world tends to be. So good luck
shipping with that :)

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-17 Thread Dongjun Shin

On 5/17/07, David Woodhouse <[EMAIL PROTECTED]> wrote:


Yes. These things are almost always implemented _very_ badly by the same
kind of crack-smoking hobo they drag in off the streets to write BIOSen.

It's bog-roll technology; if you fancy a laugh try doing some real
reliability tests on them time some. Powerfail testing is a good one.

This kind of thing is OK for disposable storage such as in digital
cameras, where it doesn't matter that it's no more reliable than a
floppy disc, but for real long-term storage it's really a bad idea.



There are so many flash-based storage and some disposable storages,
as you pointed out, have poor quality. I think it's mainly because these
are not designed for good quality, but for lowering the price.

These kind of devices are not ready for things like power failure because
their use case is far from that. For example, removing flash card
while taking pictures using digital camera is not a common use case.
(there should be a written notice that this kind of action is against
the warranty)



There's little point in optimising a file system _specifically_ for
devices which in often aren't reliable enough to keep your data anyway.
You might as well use ramfs.

It's unfortunate really -- there's no _fundamental_ reason why FTL has
to be done so badly; it's just that it almost always is. Direct access
to the flash from Linux is _always_ going to be better in practice --
and that way you avoid the problems with dual journalling, along with
the problems with the underlying FTL continuing to keep (and copy around
during GC) sectors which the top-level filesystem has actually
deallocated, etc.



There are, of course, cases where direct access are better.
However, as the demand for capacity, reliability and high performance
for the flash storage increases, the use of FTL with embedded controller
would be inevitable.

- The complexity/cost of host-side SW (like JFFS2/MTD) will increase due to
the use of multiple flash in parallel and the use of high degree ECC algorithm.
There are other things like bad block handling and wear-leveling issues
which has been recently touched by UBI with added SW complexity.

- In contrast to the embedded environment where CPU and flash is directly
connected, the I/O path between CPU and flash in PC environment is longer.
The latency for SW handshaking between CPU and flash will also be longer,
which would make the performance optimization harder.

As I mentioned, some techniques like log-structured filesystem could
perform generally better on any kind of flash-based storage with FTL.
Although there are many kinds of FTL, it is commonly true that
it performs well under workload where sequential write is dominant.

I also expect that FTL for PC environment will have better quality spec
than the disposable storage.

Dongjun
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-17 Thread Jan Engelhardt

On May 17 2007 09:42, Jeff Zheng wrote:
>
>Problem is that is only happens when you actually write data to the
>raid. You need the actual space to reproduce the problem.

That should not be a big problem. Create like 4x950G virtual sparse
drives (takes roughly or so 4x100 MB on the host after mkfs), then
set up a raid inside the vm, with which you can play - even write
to it. Not sure how RAID5 will affect the host size of the virtual
drives, since during first resync when all disks are blank, it will
XOR it (0^0=1), and hence fills up the host disk.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html