Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jeff Garzik

Jamie Lokier wrote:

Jeff Garzik wrote:

Nick Piggin wrote:

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(
Agreed...  it's also disappointing that [unless I'm mistaken] you have 
to hack each filesystem to support barriers.


It seems far easier to make sync_blkdev() Do The Right Thing, and 
magically make all filesystems data-safe.


Well, you need ordered metadata writes, barriers _and_ flushes with
some filesystems.

Merely writing all the data pages than issuing a drive cache flush
won't Do The Right Thing with those filesystems - someone already
mentioned Btrfs, where it won't.


Oh certainly.  That's why we have a VFS :)  fsync for NFS will look 
quite different, too.




But I agree that your suggestion would make a superb default, for
filesystems which don't provide their own function.


Yep.  That would immediately cover a bunch of filesystems.



It's not optimal even then.

  Devices: On a software RAID, you ideally don't want to issue flushes
  to all drives if your database did a 1 block commit entry.  (But they
  probably use O_DIRECT anyway, changing the rules again).  But all that
  can be optimised in generic VFS code eventually.  It doesn't need
  filesystem assistance in most cases.


My own idea is that we create a FLUSH command for blkdev request queues, 
to exist alongside READ, WRITE, and the current barrier implementation. 
 Then FLUSH could be passed down through MD or DM.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jeff Garzik

Nick Piggin wrote:

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(


Agreed...  it's also disappointing that [unless I'm mistaken] you have 
to hack each filesystem to support barriers.


It seems far easier to make sync_blkdev() Do The Right Thing, and 
magically make all filesystems data-safe.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-25 Thread Jeff Garzik

Jamie Lokier wrote:

By durable, I mean that fsync() should actually commit writes to
physical stable storage,


Yes, it should.



I was surprised that fsync() doesn't do this already.  There was a lot
of effort put into block I/O write barriers during 2.5, so that
journalling filesystems can force correct write ordering, using disk
flush cache commands.

After all that effort, I was very surprised to notice that Linux 2.6.x
doesn't use that capability to ensure fsync() flushes the disk cache
onto stable storage.


It's surprising you are surprised, given that this [lame] fsync behavior 
has remaining consistently lame throughout Linux's history.


[snip huge long proposal]

Rather than invent new APIs, we should fix the existing ones to _really_ 
flush data to physical media.


Linux should default to SAFE data storage, and permit users to retain 
the older unsafe behavior via an option.  It's completely ridiculous 
that we default to an unsafe fsync.


And [anticipating a common response from others] it is completely 
irrelevant that POSIX fsync(2) permits Linux's current behavior.  The 
current behavior is unsafe.


Safety before performance -- ESPECIALLY when it comes to storing user data.

Regards,

Jeff (Linux ATA driver dude)


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS partition usage...

2008-02-12 Thread Jeff Garzik

David Miller wrote:

From: Chris Mason <[EMAIL PROTECTED]>
Date: Tue, 12 Feb 2008 09:08:59 -0500


I've had requests to move the super down to 64k to make room for
bootloaders, which may not matter for sparc, but I don't really plan
on different locations for different arches.


The Sun disk label sits in the first 512 bytes and the boot loader
block sits in the second 512 bytes.

I think leaving even more space is a good idea for several reasons.



Yep.  I chose 32K unused space in the prototype filesystem I wrote [1, 
2.4 era].  I'm pretty sure I got that number from some other filesystem, 
maybe even some NTFS incarnation.  It's just good practice to avoid the 
first and last "chunks" of a partition, FSVO chunk.


Jeff


[1] http://kernel.org/pub/linux/kernel/people/jgarzik/ibu/
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Make cramfs little endian only

2007-12-04 Thread Jeff Garzik

Linus Torvalds wrote:


On Tue, 4 Dec 2007, Andi Drebes wrote:

Perhaps I'm missing somehting, but I think for cramfs, unfortunately,
there has to be this statement. The bitfields in the cramfs_inode structure
cause some problems.


I agree that bitfields can be painful, but they should likely be just 
rewritten to be accesses using actual masks and shifts. The thing is, 
bitfields aren't actually endianness safe *anyway*, in that a compiler may 
end up using a *different* bit order than the byte order.  So you cannot 
really use bitfields reliably on things like that (although Linux has a 
notion of a "__[BIG|LITTLE]_ENDIAN_BITFIELD", if you really want to).


Bitfields also generate lower-quality assembly than masks&shifts 
(typically more instructions using additional temporaries to accomplish 
the same thing), based on my own informal gcc testing.


You would think gcc would be advanced enough to turn bitfield use into 
masks and shifts under the hood, but for whatever reason that often is 
not the case in kernel code.


Due to the way they're used, bitfields make more difficult the common 
code pattern of setting several flags at once:


(assuming 'foo', 'bar' and 'baz' are bitfields in a struct)
pdev->foo = 1;
pdev->bar = 0;
pdev->baz = 1;

versus

flag_foo = (1 << 0);
flag_bar = (1 << 1);
flag_baz = (1 << 2);
...
pdev->flags = flag_foo | flag_bar;



And getting back on topic, I think "pdev->flags = 
cpu_to_le32(flag1|flag2)" is nicer than dealing with bitfields, when 
your data structures are fixed-endian.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-15 Thread Jeff Garzik

Robin Humble wrote:

On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
It is my hope that you will put your skills towards a distributed 
filesystem :)  Of the current solutions, GFS (currently in kernel) 
scales poorly, and NFS v4.1 is amazingly bloated and overly complex.


I've been waiting for years for a smart person to come along and write a 
POSIX-only distributed filesystem.


it's called Lustre.
works well, scales well, is widely used, is GPL.
sadly it's not in mainline.


Lustre is tilted far too much towards high-priced storage, and needs 
improvement before it could be considered for mainline.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread Jeff Garzik

J. Bruce Fields wrote:

On Fri, Sep 14, 2007 at 06:32:11PM -0400, Jeff Garzik wrote:

J. Bruce Fields wrote:

On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote:
NFSv4.1 adds to the fun, by throwing interoperability completely out the 
window.



What parts are you worried about in particular?



I'm not worried; I'm stating facts as they exist today (draft 13):

NFS v4.1 does something completely without precedent in the history of NFS: 
 the specification is defined such that interoperability is -impossible- to 
guarantee.


pNFS permits private and unspecified layout types.  This means it is 
impossible to guarantee that one NFSv4.1 implementation will be able to 
talk another NFSv4.1 implementation.



No, servers are required to support ordinary nfs operations to the
metadata server.

At least, that's the way it was last I heard, which was a while ago.  I
agree that it'd stink (for any number of reasons) if you ever *had* to
get a layout to access some file.

Was that your main concern?


I just sorta assumed you could fall back to the NFSv4.0 mode of 
operation, going through the metadata server for all data accesses.


But look at that choice in practice:  you can either ditch pNFS 
completely, or use a proprietary solution.  The market incentives are 
CLEARLY tilted in favor of makers of proprietary solutions.  But it's a 
poor choice (really little choice at all).


Overall, my main concern is that NFSv4.1 is no longer an open 
architecture solution.  The "no-pNFS or proprietary platform" choice 
merely illustrate one of many negative aspects of this architecture.


One of NFS's biggest value propositions is its interoperability.  To 
quote some Wall Street guys, "NFS is like crack.  It Just Works.  We 
love it."


Now, for the first time in NFS's history (AFAIK), the protocol is no 
longer completely specified, completely known.  No longer a "closed 
loop."  Private layout types mean that it is _highly_ unlikely that any 
OS or appliance or implementation will be able to claim "full NFS 
compatibility."


And when the proprietary portion of the spec involves something as basic 
as accessing one's own data, I consider that a fundamental flaw.  NFS is 
no longer completely open.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread Jeff Garzik

J. Bruce Fields wrote:

On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote:

J. Bruce Fields wrote:

On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
I've been waiting for years for a smart person to come along and write a 
POSIX-only distributed filesystem.

What exactly do you mean by "POSIX-only"?
Don't bother supporting attributes, file modes, and other details not 
supported by POSIX.  The prime example being NFSv4, which is larded down 
with Windows features.


I am sympathetic  Cutting those out may still leave you with
something pretty complicated, though.


Far less complicated than NFSv4.1 though (which is easy :))


NFSv4.1 adds to the fun, by throwing interoperability completely out the 
window.


What parts are you worried about in particular?


I'm not worried; I'm stating facts as they exist today (draft 13):

NFS v4.1 does something completely without precedent in the history of 
NFS:  the specification is defined such that interoperability is 
-impossible- to guarantee.


pNFS permits private and unspecified layout types.  This means it is 
impossible to guarantee that one NFSv4.1 implementation will be able to 
talk another NFSv4.1 implementation.


Even if Linux supports the entire NFSv4.1 RFC (as it stands in draft 13 
anyway), there is no guarantee at all that Linux will be able to store 
and retrieve data, since it's entirely possible that a proprietary 
protocol is required to access your data.


NFSv4.1 is no longer a completely open architecture.

Jeff




-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread Jeff Garzik

J. Bruce Fields wrote:

On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
I've been waiting for years for a smart person to come along and write a 
POSIX-only distributed filesystem.



What exactly do you mean by "POSIX-only"?


Don't bother supporting attributes, file modes, and other details not 
supported by POSIX.  The prime example being NFSv4, which is larded down 
with Windows features.


NFSv4.1 adds to the fun, by throwing interoperability completely out the 
window.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread Jeff Garzik

Evgeniy Polyakov wrote:

Hi.

I'm pleased to announce fourth release of the distributed storage
subsystem, which allows to form a storage on top of remote and local
nodes, which in turn can be exported to another storage as a node to
form tree-like storages.

This release includes new configuration interface (kernel connector over
netlink socket) and number of fixes of various bugs found during move 
to it (in error path).


Further TODO list includes:
* implement optional saving of mirroring/linear information on the remote
nodes (simple)
* new redundancy algorithm (complex)
* some thoughts about distributed filesystem tightly connected to DST
(far-far planes so far)

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>


My thoughts.  But first a disclaimer:   Perhaps you will recall me as 
one of the people who really reads all your patches, and examines your 
code and proposals closely.  So, with that in mind...


I question the value of distributed block services (DBS), whether its 
your version or the others out there.  DBS are not very useful, because 
it still relies on a useful filesystem sitting on top of the DBS.  It 
devolves into one of two cases:  (1) multi-path much like today's SCSI, 
with distributed filesystem arbitrarion to ensure coherency, or (2) the 
filesystem running on top of the DBS is on a single host, and thus, a 
single point of failure (SPOF).


It is quite logical to extend the concepts of RAID across the network, 
but ultimately you are still bound by the inflexibility and simplicity 
of the block device.


In contrast, a distributed filesystem offers far more scalability, 
eliminates single points of failure, and offers more room for 
optimization and redundancy across the cluster.


A distributed filesystem is also much more complex, which is why 
distributed block devices are so appealing :)


With a redundant, distributed filesystem, you simply do not need any 
complexity at all at the block device level.  You don't even need RAID.


It is my hope that you will put your skills towards a distributed 
filesystem :)  Of the current solutions, GFS (currently in kernel) 
scales poorly, and NFS v4.1 is amazingly bloated and overly complex.


I've been waiting for years for a smart person to come along and write a 
POSIX-only distributed filesystem.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 18/26] FS: ExtX filesystem defrag

2007-09-01 Thread Jeff Garzik
Please add 'slab' to the title, otherwise you conflict with a feature of 
the same name...



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-27 Thread Jeff Garzik

Alex Tomas wrote:

So without the ability to attach specific I/O completions to bios
or support for unwritten extents directly in __mpage_writepage,
there is no way XFS can use this "generic" delayed allocation code.


I didn't say "generic", see Subject: :)


Well, it shouldn't even be in the VFS layer if it's only usable by one 
filesystem.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-26 Thread Jeff Garzik

Alex Tomas wrote:

Jeff Garzik wrote:

Is this based on Christoph's work?

Christoph, or some other XFS hacker, already did generic delalloc, 
modeled on the XFS delalloc code.


nope, this one is simple (something I'd prefer for ext4).


The XFS one is proven and the work was already completed.

What were the specific technical issues that made it unsuitable for ext4?

I would rather not reinvent the wheel, particularly if the reinvention 
is less capable than the existing work.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-26 Thread Jeff Garzik

Alex Tomas wrote:

Good day,

please review ...

thanks, Alex


basic delayed allocation in VFS:

 * block_prepare_write() can be passed special ->get_block() which
   doesn't allocate blocks, but reserve them and mark bh delayed
 * a filesystem can use mpage_da_writepages() with other ->get_block()
   which doesn't defer allocation. mpage_da_writepages() finds all
   non-allocated blocks and try to allocate them with minimal calls
   to ->get_block(), then submit IO using __mpage_writepage()


Signed-off-by: Alex Tomas <[EMAIL PROTECTED]>


Is this based on Christoph's work?

Christoph, or some other XFS hacker, already did generic delalloc, 
modeled on the XFS delalloc code.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


new ext4 build warnings

2007-07-18 Thread Jeff Garzik

It seems jbd_debug() might need modification:

fs/ext4/inode.c: In function ‘ext4_write_inode’:
fs/ext4/inode.c:2906: warning: comparison is always true due to limited 
range of data type


fs/jbd2/recovery.c: In function ‘jbd2_journal_recover’:
fs/jbd2/recovery.c:254: warning: comparison is always true due to 
limited range of data type
fs/jbd2/recovery.c:257: warning: comparison is always true due to 
limited range of data type


fs/jbd2/recovery.c: In function ‘jbd2_journal_skip_recovery’:
fs/jbd2/recovery.c:301: warning: comparison is always true due to 
limited range of data type


I'm surprised this was not noticed in a test build before pushing upstream.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: *at syscalls for xattrs?

2007-07-16 Thread Jeff Garzik

H. Peter Anvin wrote:

Jeff Garzik wrote:

What the *at() interfaces really do is fix/paper over a longstanding
wart in Unix: the cwd really should have been a standard file descriptor
(like stdin/stdout/stderr) instead of a magic piece of state maintained
in kernel space.

It's more than a wart, IMO.  *at() allows one to close races (with
potential security implications) that are otherwise impossible to close,
in directory traversal.

*at() permits a userspace program to hold proper references to all
objects during a directory traversal, with all that implies.



Well, as Jeremy pointed out, in the absence of threads you can do the
same thing with fchdir(), however, that's much more of a hack.


My posixutils project (coreutils replacement) used fchdir(2), but that 
still doesn't get you 100% race-free.  It gets you close, yes.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: *at syscalls for xattrs?

2007-07-16 Thread Jeff Garzik

H. Peter Anvin wrote:

Miklos Szeredi wrote:

The *at() thing basically gives you the advantages of a CWD without
the disadvantages.

For example it could be useful to implement the functionality of
find(1) as a library interface.



What the *at() interfaces really do is fix/paper over a longstanding
wart in Unix: the cwd really should have been a standard file descriptor
(like stdin/stdout/stderr) instead of a magic piece of state maintained
in kernel space.


It's more than a wart, IMO.  *at() allows one to close races (with 
potential security implications) that are otherwise impossible to close, 
in directory traversal.


*at() permits a userspace program to hold proper references to all 
objects during a directory traversal, with all that implies.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] util-linux-ng 2.13-rc1

2007-07-06 Thread Jeff Garzik

Gerd Hoffmann wrote:

Jeff Garzik wrote:

Christoph Hellwig wrote:

And this is really dumb.  autotools is a completely pain in the ass and
not useful at all for linux-only tools.

A myth.  It is quite useful for packagers, because of the high Just
Works(tm) factor.  After porting an entire across several revisions of a
distro, the autotools-based packages are the ones that work out of the
box 90% of the time.


And the 10% where it doesn't work it is a real pain to figure what goes
wrong due to the completely unreadable Makefiles generated by autotools.
 After all they are not Makefiles, they are shellscripts embedded into
Makefiles.


The other 90% of _my_ time comes from annoying people who roll their own
Makefile/build solution, which the packager has to then learn.


Well, it's not *that* hard to write makefiles which follow the usual
gnuish conventions, so stuff like "make DESTDIR=/tmp/buildroot install"
works just fine.  That isn't a reason to use autotools.  Especially as
people get that wrong *even with* autotools from time to time ...


It's not _just_ makefiles, though.  Packaging systems know what to do 
with configure scripts, and automatically plug that into their systems, 
e.g. with rpm's %configure, %make_install, etc.


Having ported an entire distro, the time savings with autotools [OR 
ANOTHER STANDARD BUILD/CONFIGURE SYSTEM] are very real.  Similarly, the 
time sink with each project doing its own home-rolled build/configure 
system is also very real.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] util-linux-ng 2.13-rc1

2007-07-05 Thread Jeff Garzik

Christoph Hellwig wrote:

On Wed, Jul 04, 2007 at 12:11:56AM +0200, Karel Zak wrote:

 The package build system is now based on autotools. The build system
 supports  separate CFLAGS and LDFLAGS for suid programs (SUID_CFLAGS,
 SUID_LDFLAGS). For more details see the README file


And this is really dumb.  autotools is a completely pain in the ass and
not useful at all for linux-only tools.



A myth.  It is quite useful for packagers, because of the high Just 
Works(tm) factor.  After porting an entire across several revisions of a 
distro, the autotools-based packages are the ones that work out of the 
box 90% of the time.


The other 90% of _my_ time comes from annoying people who roll their own 
Makefile/build solution, which the packager has to then learn.


It's just not scalable for people to keep building their own build 
solutions.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] fsblock

2007-06-30 Thread Jeff Garzik

Christoph Hellwig wrote:

On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote:

- In line with the above item, filesystem block allocation is performed
 before a page is dirtied. In the buffer layer, mmap writes can dirty a
 page with no backing blocks which is a problem if the filesystem is
 ENOSPC (patches exist for buffer.c for this).
This raises an eyebrow...  The handling of ENOSPC prior to mmap write is 
more an ABI behavior, so I don't see how this can be fixed with internal 
changes, yet without changing behavior currently exported to userland 
(and thus affecting code based on such assumptions).



Not really, the current behaviour is a bug.  And it's not actually buffer
layer specific - XFS now has a fix for that bug and it's generic enough
that everyone could use it.


I'm not sure I follow.  If you require block allocation at mmap(2) time, 
rather than when a page is actually dirtied, you are denying userspace 
the ability to do sparse files with mmap.


A quick Google readily turns up people who have built upon the 
mmap-sparse-file assumption, and I don't think we want to break those 
assumptions as a "bug fix."


Where is the bug?

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-29 Thread Jeff Garzik

Theodore Tso wrote:

I don't think we have a problem here.  What we have now is fine, and


It's fine for ext4, but not the wider world.  This is a common problem 
created by parallel development when code dependencies exist.




In any case, the plan is to push all of the core bits into Linus tree
for 2.6.22 once it opens up, which should be Real Soon Now, it looks
like.


Presumably you mean 2.6.23.

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-28 Thread Jeff Garzik

Andrew Morton wrote:

b) We do what we normally don't do and reserve the syscall slots in mainline.


If everyone agrees it's going to happen... why not?

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] fsblock

2007-06-23 Thread Jeff Garzik

Nick Piggin wrote:

- No deadlocks (hopefully). The buffer layer is technically deadlocky by
  design, because it can require memory allocations at page writeout-time.
  It also has one path that cannot tolerate memory allocation failures.
  No such problems for fsblock, which keeps fsblock metadata around for as
  long as a page is dirty (this still has problems vs get_user_pages, but
  that's going to require an audit of all get_user_pages sites. Phew).

- In line with the above item, filesystem block allocation is performed
  before a page is dirtied. In the buffer layer, mmap writes can dirty a
  page with no backing blocks which is a problem if the filesystem is
  ENOSPC (patches exist for buffer.c for this).


This raises an eyebrow...  The handling of ENOSPC prior to mmap write is 
more an ABI behavior, so I don't see how this can be fixed with internal 
changes, yet without changing behavior currently exported to userland 
(and thus affecting code based on such assumptions).




- An inode's metadata must be tracked per-inode in order for fsync to
  work correctly. buffer contains helpers to do this for basic
  filesystems, but any block can be only the metadata for a single inode.
  This is not really correct for things like inode descriptor blocks.
  fsblock can track multiple inodes per block. (This is non trivial,
  and it may be overkill so it could be reverted to a simpler scheme
  like buffer).


hrm; no specific comment but this seems like an idea/area that needs to 
be fleshed out more, by converting some of the more advanced filesystems.




- Large block support. I can mount and run an 8K block size minix3 fs on
  my 4K page system and it didn't require anything special in the fs. We
  can go up to about 32MB blocks now, and gigabyte+ blocks would only
  require  one more bit in the fsblock flags. fsblock_superpage blocks
  are > PAGE_CACHE_SIZE, midpage ==, and subpage <.


definitely useful, especially if I rewrite my ibu filesystem for 2.6.x, 
like I've been planning.




So. Comments? Is this something we want? If yes, then how would we
transition from buffer.c to fsblock.c?


Your work is definitely interesting, but I think it will be even more 
interesting once ext2 (w/ dir in pagecache) and ext3 (journalling) are 
converted.


My gut feeling is that there are several problem areas you haven't hit 
yet, with the new code.


Also, once things are converted, the question of transitioning from 
buffer.c will undoubtedly answer itself.  That's the way several of us 
handle transitions:  finish all the work, then look with fresh eyes and 
conceive a path from the current code to your enhanced code.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Jeff Garzik

Andreas Dilger wrote:

My comment was just that the extent doesn't have to be explicitly zero
filled on the disk, by virtue of the fact that the uninitialized flag
will cause reads to return zero.



Agreed, thanks for the clarification.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Jeff Garzik

Andreas Dilger wrote:

On May 07, 2007  13:58 -0700, Andrew Morton wrote:

Final point: it's fairly disappointing that the present implementation is
ext4-only, and extent-only.  I do think we should be aiming at an ext4
bitmap-based implementation and an ext3 implementation.


Actually, this is a non-issue.  The reason that it is handled for extent-only
is that this is the only way to allocate space in the filesystem without
doing the explicit zeroing.  For other filesystems (including ext3 and


Precisely /how/ do you avoid the zeroing issue, for extents?

If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
otherwise the implementation is broken.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: REISER4 FOR INCLUSION IN THE LINUX KERNEL.

2007-04-08 Thread Jeff Garzik

[EMAIL PROTECTED] wrote:

YOU GUYS WILL LAUGH ABOUT THIS:


Yes, we are laughing at you.

You keep using bonnie++ after being told it's a poor benchmark.

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-08 Thread Jeff Garzik

David H. Lynch Jr wrote:

Jeff Garzik wrote:

David H. Lynch Jr wrote:

I'm arguing against circular logic:  the claim that one cannot
determine reiser4's true usefulness unless its in the tree.

The better method is to get a distro to add reiser4, _then_ if it
proves worthy add it to the kernel tree.

Not the other way around. 



And is that how other filesystems made it into the tree ?


In the case of most major filesystems, yes.  Distros are a proving 
ground for new stuff, not the upstream kernel.




I regularly see drivers with very little in the way of testing go
straight nearly straight into the tree - without even getting tagged as
experimental.


Hardware drivers are vastly different from filesystem drivers.

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: REISER4 FOR INCLUSION IN THE LINUX KERNEL.

2007-04-08 Thread Jeff Garzik

[EMAIL PROTECTED] wrote:

REISER4 FOR INCLUSION IN THE LINUX KERNEL.

Dave Lynch takes a reasoned approach to REISER4.

Dave Lynch wrote:

Jeff Garzik wrote:

If the compelling reason is that it needs a test, I'd say its not ready.


Can you please elaborate ? I am not sure I understand what you are
arguing ?


Jeff Garzik is "saying" that he wants REISER4 to stay out of the main
kernel, for reasons he is not willing to tell you.


False.  I have told you the reasons.



I for one would at least play with it if it were in the distribution
tree.


I AM SURE THERE ARE A HUGE NUMBER OF PEOPLE WHO WOULD GIVE IT A TRY.


You can download it now.  Nobody is stopping you, or anyone else.


As far as I could tell Hans pretty much everything else that 
was demanded. Hans eventually caved and provided - albeit with much 
pissing and moaning, and holy than thou rhetoric.


It was not his pissing and moaning, etc,... these were just excuses to
keep REISER4 from succeeding. The truth is, that any excuse would do.

The real reasons are financial and backed by big money (sometimes, big
egos).


Put down the conspiracy crackpipe.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-08 Thread Jeff Garzik

David H. Lynch Jr wrote:

Jeff Garzik wrote:

If the compelling reason is that it needs a test, I'd say its not ready.


Can you please elaborate ? I am not sure I understand what you are
arguing ?

Despite his substantially less than polite rhetoric, I have read
Hans's post from months if not years ago.
Aside from the pissing contests - which where not entirely one
sided, I actually beleive that Hans made a reasonable case
that Reiser4 had gone about as far as it could reasonably go with
regard to testing, robustness, ... without the broader base of
use that even an experimental filesystem in distribution tree would get.

I for one would atleast play with it if it were in the distribution
tree.
As far as I could tell pretty much everything else that was demanded
Hans eventually caved and provided - albeit with much pissing and moaning,
and holy than thou rhetoric.

The argument that anything that needs testing can't get into the
distribution tree's is specious. There is alot of poorly tested crap in
the distribution trees.


I'm arguing against circular logic:  the claim that one cannot determine 
reiser4's true usefulness unless its in the tree.


The better method is to get a distro to add reiser4, _then_ if it proves 
worthy add it to the kernel tree.


Not the other way around.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-08 Thread Jeff Garzik

David H. Lynch Jr wrote:

I do care about getting Reiser4 into the kernel so that it can
actually get a real test,
and frankly do not see any compelling reason that should not happen.



If the compelling reason is that it needs a test, I'd say its not ready.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: impact of 4k sector size on the IO & FS stack

2007-03-12 Thread Jeff Garzik

Douglas Gilbert wrote:

Bryan Henderson wrote:

What is an odd-aligned disk?


s/disk/partition/ ?



Example:  An odd-aligned disk in the 512-b logical / 1K-physical 
scenario is where odd LBAs indicate the start of a 1K physical sector. 
An even-aligned disk is where even LBAs indicate the start of a 1K 
physical sector.


In order to avoid too many RMW cycles, partition software SHOULD (using 
IETF language) be aware of the underlying physical sector size 
alignment, in order to align paritions for optimal performance.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: impact of 4k sector size on the IO & FS stack

2007-03-12 Thread Jeff Garzik

Christoph Hellwig wrote:

the occasional 2k sector SCSI MO device aswell.  It would be nice to
get samples of large sector size ATA devices into the hands of developers
to do real world testing of the whole stack.


"hands of developers" meaning you specifically?  :)

I've had a 512b-logical/1K-physical ATA test drive for a few months now, 
and another couple arrived today.


Hopefully people can parse what I've been posting, since I cannot give 
out raw numbers or data at this time.


Of course, with RMW drives that leave the 512-b logical interface 
untouched, I had expected that they would Just Work(tm) and that is 
pretty much what happened.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: impact of 4k sector size on the IO & FS stack

2007-03-12 Thread Jeff Garzik

Jan Engelhardt wrote:

On Mar 11 2007 22:45, Ric Wheeler wrote:

Jan Engelhardt wrote:

On Mar 11 2007 18:51, Ric Wheeler wrote:


During the recent IO/FS workshop, we spoke briefly about the
coming change to a 4k sector size for disks on linux. If I
recall correctly, the general feeling was that the impact was
not significant since we already do most file system IO in 4k
page sizes and should be fine as long as we partition drives
correctly and avoid non-4k aligned partitions.


Sorry about jumping right in, but what about an 'old-style'
partition table that relies on 512 as a unit?



I think that the normal case would involve new drives which
would need to be partitioned in 4k aligned partitions.
Shouldn't that work regardless of the unit used in the
partition table?


Assume this partition table on my current HD:

Disk /dev/hdc: 251.0 GB, 251000193024 bytes
255 heads, 63 sectors/track, 30515 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Start  End  Blocks   Id  System
/dev/hdc1   1 33  265041   82  Linux swap / Solaris
/dev/hdc2  34  30515   2448466655  Extended

That is, 255 * 63 * 30515 * 512 == roughly 251 GB.

Now, if this disk was copied byte per byte (/bin/dd) to a
4096-based disk, and Linux would start using a sector size of
4096, then I would suddenly have

255 * 63 * 30515 * 4096 == 2 TB

Although I would not mind the 2 TB, the partition table would
read quite differently (note the Blocks column which is
multiplied by 4 (512x4=4096))


At this level, for RMW drives, nothing changes.  The partition software, 
ATA driver, and all other bits continue to think that sector size == 512 
bytes.


The partition software /hopefully/ becomes smart enough to understand 
the alignment necessary, but that is not a requirement.


This is the key to understanding the difference between a physical 
(==platters) sector size change without a logical (==ATA interface) 
sector size change.




   Device Start  End  Blocks   Id  System
/dev/hdc1   1 33 1060164   82  Linux swap / Solaris
/dev/hdc2  34  30515   9793866605  Extended

Which would mean that the swap partition reaches into the real
data partition and would corrupt it.


For RMW drives, RMW cycles would occur but not corruption.

For non-RMW drives, this just wouldn't occur.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: impact of 4k sector size on the IO & FS stack

2007-03-12 Thread Jeff Garzik

Jan Engelhardt wrote:

On Mar 11 2007 18:51, Ric Wheeler wrote:

During the recent IO/FS workshop, we spoke briefly about the
coming change to a 4k sector size for disks on linux. If I
recall correctly, the general feeling was that the impact was
not significant since we already do most file system IO in 4k
page sizes and should be fine as long as we partition drives
correctly and avoid non-4k aligned partitions.


Sorry about jumping right in, but what about an 'old-style'
partition table that relies on 512 as a unit?


For 1K/4K physical sector size, where logical sector size remains 512-b, 
nothing changes.  DOS partitions start partitions on odd-numbered 
sectors, so presuming you have odd-aligned disks, life is good.


For 1K/4K logical sector sizes, who knows.  EFI?  

Certainly seems incompatible with the current popular DOS partition format.

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: impact of 4k sector size on the IO & FS stack

2007-03-12 Thread Jeff Garzik

Alan Cox wrote:
First generation of 1K sector drives will continue to use the same 
512-byte ATA sector size you are familiar with.  A single 512-byte write 
will cause the drive to perform a read-modify-write cycle.  This 
configuration is physical 1K sector, logical 512b sector.


The problem case is "read-modify-screwup"

At that point we've trashed the block we were writing (a well studied
recovery case), and we've blasted some previously sane, totally
unrelated sector of data out of existance. Thats why we need to know
ideally if they are doing the write to a different physical block when
they do this, so that we don't lose the old data. My guess is they won't
as it'll be hard.


Strict ATA command set answer:  you will have no idea what goes on under 
the hood.  The current 512-b interface stays /exactly/ the same, save 
for a word or two in IDENTIFY DEVICE telling you the "secret" physical 
sector size.  If all your I/Os are aligned properly, then you need not 
worry about RMW cycles, as they will not occur.


Intuition answer:  they will use their firmware-internal standard code 
for scheduling reads and writes, and will only reallocate sectors as 
needed by media failure or similar events.


The "M" part of the modify cycle happens in disk ram.  So from the 
disk's point of view, a single 512-b write would require reading a 
single 1K hard sector, updating the contents in cache RAM, and then 
writing a single 1K hard sector.  The reading of the unknown half of the 
sector can be scheduled well in advance, usually, since writeback 
caching gives the drive plenty of time (relatively speaking) to optimize 
things.


Overall, it definitely adds a few more points of failure, but we can't 
do much at all about those points of failure.


In my own experiments on my own Fedora workstation, ~66% of IOs in Linux 
start on an odd sector, and ~33% started on even-numbered sectors.  For 
a 1K-sector drive with 'odd' alignment, the configuration Microsoft will 
likely want, that means the majority of disk transactions will avoid a 
RMW cycle, but a still-numerous minority will not.  I did not test 
transfer length, to see how many transfers /ended/ on an odd sector, 
thus determining how many RMW cycles the tail of an average I/O requires.




A future configuration will change the logical ATA interface away from 
512-byte sectors to 1K or 4K.  Here, it is impossible to read a quantity 
smaller than 1K or 4K, whatever the sector size is.


That one I'm not worried about - other than "guess how Redmond decide to
make partition tables work" that one is mostly easy (be fun to see how
many controllers simply can't cope with the command formats)


Indeed...

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: impact of 4k sector size on the IO & FS stack

2007-03-11 Thread Jeff Garzik

Alan Cox wrote:

I would be interested to know what the disk vendors intend to use as
their strategy when (with ATA) they have a 512 byte write from an older
file system/setup into a 4K block. The case where errors magically appear


Well, you have logical and physical sector size changes.

First generation of 1K sector drives will continue to use the same 
512-byte ATA sector size you are familiar with.  A single 512-byte write 
will cause the drive to perform a read-modify-write cycle.  This 
configuration is physical 1K sector, logical 512b sector.


A future configuration will change the logical ATA interface away from 
512-byte sectors to 1K or 4K.  Here, it is impossible to read a quantity 
smaller than 1K or 4K, whatever the sector size is.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Jeff Garzik

Amit K. Arora wrote:

This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation "fallocate", for persistent preallocation. The new
system call, as Andrew suggested, will look like:

  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

As we are developing and testing the required patches, we decided to
post a preliminary patch and get inputs from the community to give it
a right direction and shape. First, a little description on the feature.
 
Persistent preallocation is a file system feature using which an

application (say, relational database servers) can explicitly
preallocate blocks to a particular file. This feature can be used to
reserve space for a file to get mainly the following benefits:
1> contiguity - less defragmentation and thus faster access speed, and
2> guarantee for a minimum space availibility (depending on how many
blocks were preallocated) for the file, even if the filesystem becomes
full.

XFS already has an implementation for this, using an ioctl interface. And,
ext4 is now coming up with this feature. In coming time we may see a few
more file systems implementing this. Thus, it makes sense to have a more
standard interface for this, like this new system call.

Here is the initial and incomplete version of the patch, which can be
used for the discussion, till we come up with a set of more complete
patches.

---
 arch/i386/kernel/syscall_table.S |1 +
 fs/ext4/file.c   |1 +
 fs/open.c|   18 ++
 include/asm-i386/unistd.h|3 ++-
 include/linux/fs.h   |1 +
 include/linux/syscalls.h |1 +
 6 files changed, 24 insertions(+), 1 deletion(-)


I certainly agree that we want something like this.

posix_fallocate() is the glibc interface we want to be compatible with 
(which your definition is, AFAICS).


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-26 Thread Jeff Garzik

Theodore Tso wrote:

Can someone with knowledge of current disk drive behavior confirm that
for all drives that support bad block sparing, if an attempt to write
to a particular spot on disk results in an error due to bad media at
that spot, the disk drive will automatically rewrite the sector to a
sector in its spare pool, and automatically redirect that sector to
the new location.  I believe this should be always true, so presumably
with all modern disk drives a write error should mean something very
serious has happend.  



This is what will /probably/ happen.  The drive should indeed find a 
spare sector and remap it, if the write attempt encounters a bad spot on 
the media.


However, with a large enough write, large enough bad-spot-on-media, and 
a firmware programmed to never take more than X seconds to complete 
their enterprise customers' I/O, it might just fail.



IMO, somewhere in the kernel, when we receive a read-op or write-op 
media error, we should immediately try to plaster that area with small 
writes.  Sure, if it's a read-op you lost data, but this method will 
maximize the chance that you can refresh/reuse the logical sectors in 
question.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[git patch, resend] remove JFFS v1

2007-02-17 Thread Jeff Garzik

[just sent this upstream; obvious file-removal patch snipped for size]

(resend) 

Why:Unmaintained for years, superceded by JFFS2 for years.

Please pull from 'kill-jffs' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/misc-2.6.git kill-jffs

to receive the following updates:

 Documentation/feature-removal-schedule.txt |7 -
 fs/Kconfig |   26 -
 fs/Makefile|1 -
 fs/jffs/Makefile   |   11 -
 fs/jffs/inode-v23.c| 1847 ---
 fs/jffs/intrep.c   | 3449 
 fs/jffs/intrep.h   |   58 -
 fs/jffs/jffs_fm.c  |  798 ---
 fs/jffs/jffs_fm.h  |  149 --
 fs/jffs/jffs_proc.c|  261 ---
 fs/jffs/jffs_proc.h|   28 -
 include/linux/jffs.h   |  224 --
 12 files changed, 0 insertions(+), 6859 deletions(-)
 delete mode 100644 fs/jffs/Makefile
 delete mode 100644 fs/jffs/inode-v23.c
 delete mode 100644 fs/jffs/intrep.c
 delete mode 100644 fs/jffs/intrep.h
 delete mode 100644 fs/jffs/jffs_fm.c
 delete mode 100644 fs/jffs/jffs_fm.h
 delete mode 100644 fs/jffs/jffs_proc.c
 delete mode 100644 fs/jffs/jffs_proc.h
 delete mode 100644 include/linux/jffs.h

Jeff Garzik (1):
  Remove JFFS (version 1), as scheduled.

diff --git a/Documentation/feature-removal-schedule.txt 
b/Documentation/feature-removal-schedule.txt
index c585aa8..e1bc0c5 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -306,13 +306,6 @@ Who:   Len Brown <[EMAIL PROTECTED]>
 
 ---
 
-What:  JFFS (version 1)
-When:  2.6.21
-Why:   Unmaintained for years, superceded by JFFS2 for years.
-Who:   Jeff Garzik <[EMAIL PROTECTED]>
-

-
 What:   sk98lin network driver
 When:   July 2007
 Why:In kernel tree version of driver is unmaintained. Sk98lin driver
diff --git a/fs/Kconfig b/fs/Kconfig
index a722b5a..3c4886b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1189,32 +1189,6 @@ config EFS_FS
  To compile the EFS file system support as a module, choose M here: the
  module will be called efs.
 
-config JFFS_FS
-   tristate "Journalling Flash File System (JFFS) support"
-   depends on MTD && BLOCK && BROKEN
-   help
- JFFS is the Journalling Flash File System developed by Axis
- Communications in Sweden, aimed at providing a crash/powerdown-safe
- file system for disk-less embedded devices. Further information is
- available at (<http://developer.axis.com/software/jffs/>).
-
- NOTE: This filesystem is deprecated and is scheduled for removal in
- 2.6.21.  See Documentation/feature-removal-schedule.txt
-
-config JFFS_FS_VERBOSE
-   int "JFFS debugging verbosity (0 = quiet, 3 = noisy)"
-   depends on JFFS_FS
-   default "0"
-   help
- Determines the verbosity level of the JFFS debugging messages.
-
-config JFFS_PROC_FS
-   bool "JFFS stats available in /proc filesystem"
-   depends on JFFS_FS && PROC_FS
-   help
- Enabling this option will cause statistics from mounted JFFS file 
systems
- to be made available to the user in the /proc/fs/jffs/ directory.
-
 config JFFS2_FS
tristate "Journalling Flash File System v2 (JFFS2) support"
select CRC32
diff --git a/fs/Makefile b/fs/Makefile
index b9ffa63..9edf411 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -94,7 +94,6 @@ obj-$(CONFIG_HPFS_FS) += hpfs/
 obj-$(CONFIG_NTFS_FS)  += ntfs/
 obj-$(CONFIG_UFS_FS)   += ufs/
 obj-$(CONFIG_EFS_FS)   += efs/
-obj-$(CONFIG_JFFS_FS)  += jffs/
 obj-$(CONFIG_JFFS2_FS) += jffs2/
 obj-$(CONFIG_AFFS_FS)  += affs/
 obj-$(CONFIG_ROMFS_FS) += romfs/
[snip file deletion patch]
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[git patch] remove jffs (v1)

2007-02-08 Thread Jeff Garzik

Please pull from 'kill-jffs' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/misc-2.6.git kill-jffs

to receive the following updates:

 Documentation/feature-removal-schedule.txt |7 -
 fs/Kconfig |   26 -
 fs/Makefile|1 -
 fs/jffs/Makefile   |   11 -
 fs/jffs/inode-v23.c| 1847 ---
 fs/jffs/intrep.c   | 3449 
 fs/jffs/intrep.h   |   58 -
 fs/jffs/jffs_fm.c  |  798 ---
 fs/jffs/jffs_fm.h  |  149 --
 fs/jffs/jffs_proc.c|  261 ---
 fs/jffs/jffs_proc.h|   28 -
 include/linux/jffs.h   |  224 --
 12 files changed, 0 insertions(+), 6859 deletions(-)
 delete mode 100644 fs/jffs/Makefile
 delete mode 100644 fs/jffs/inode-v23.c
 delete mode 100644 fs/jffs/intrep.c
 delete mode 100644 fs/jffs/intrep.h
 delete mode 100644 fs/jffs/jffs_fm.c
 delete mode 100644 fs/jffs/jffs_fm.h
 delete mode 100644 fs/jffs/jffs_proc.c
 delete mode 100644 fs/jffs/jffs_proc.h
 delete mode 100644 include/linux/jffs.h

Jeff Garzik (1):
  Delete JFFS (version 1), as scheduled.

diff --git a/Documentation/feature-removal-schedule.txt 
b/Documentation/feature-removal-schedule.txt
index 0ba6af0..fc53239 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -318,10 +318,3 @@ Why:   /proc/acpi/button has been replaced by events 
to the input layer
 Who:   Len Brown <[EMAIL PROTECTED]>
 
 ---
-
-What:  JFFS (version 1)
-When:  2.6.21
-Why:   Unmaintained for years, superceded by JFFS2 for years.
-Who:   Jeff Garzik <[EMAIL PROTECTED]>
-

diff --git a/fs/Kconfig b/fs/Kconfig
index 8cd2417..67a50c9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1196,32 +1196,6 @@ config EFS_FS
  To compile the EFS file system support as a module, choose M here: the
  module will be called efs.
 
-config JFFS_FS
-   tristate "Journalling Flash File System (JFFS) support"
-   depends on MTD && BLOCK && BROKEN
-   help
- JFFS is the Journalling Flash File System developed by Axis
- Communications in Sweden, aimed at providing a crash/powerdown-safe
- file system for disk-less embedded devices. Further information is
- available at (<http://developer.axis.com/software/jffs/>).
-
- NOTE: This filesystem is deprecated and is scheduled for removal in
- 2.6.21.  See Documentation/feature-removal-schedule.txt
-
-config JFFS_FS_VERBOSE
-   int "JFFS debugging verbosity (0 = quiet, 3 = noisy)"
-   depends on JFFS_FS
-   default "0"
-   help
- Determines the verbosity level of the JFFS debugging messages.
-
-config JFFS_PROC_FS
-   bool "JFFS stats available in /proc filesystem"
-   depends on JFFS_FS && PROC_FS
-   help
- Enabling this option will cause statistics from mounted JFFS file 
systems
- to be made available to the user in the /proc/fs/jffs/ directory.
-
 config JFFS2_FS
tristate "Journalling Flash File System v2 (JFFS2) support"
select CRC32
diff --git a/fs/Makefile b/fs/Makefile
index b9ffa63..9edf411 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -94,7 +94,6 @@ obj-$(CONFIG_HPFS_FS) += hpfs/
 obj-$(CONFIG_NTFS_FS)  += ntfs/
 obj-$(CONFIG_UFS_FS)   += ufs/
 obj-$(CONFIG_EFS_FS)   += efs/
-obj-$(CONFIG_JFFS_FS)  += jffs/
 obj-$(CONFIG_JFFS2_FS) += jffs2/
 obj-$(CONFIG_AFFS_FS)  += affs/
 obj-$(CONFIG_ROMFS_FS) += romfs/

[snip obvious diff that deletes fs/jffs/* and include/linux/jffs.h]
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[git patch] mention JFFS impending death

2007-01-22 Thread Jeff Garzik

JFFS is already marked CONFIG_BROKEN in fs/Kconfig, with a note that
it's going away in 2.6.21, but the corresponding update to
feature-removal-schedule.txt was accidentally omitted.  Fixed.

Please pull from 'kill-jffs-prep' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/misc-2.6.git kill-jffs-prep

to receive the following updates:

 Documentation/feature-removal-schedule.txt |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

Jeff Garzik (1):
  Note that JFFS (v1) is to be deleted, in feature-removal-schedule.txt

diff --git a/Documentation/feature-removal-schedule.txt 
b/Documentation/feature-removal-schedule.txt
index fc53239..0ba6af0 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -318,3 +318,10 @@ Why:   /proc/acpi/button has been replaced by events 
to the input layer
 Who:   Len Brown <[EMAIL PROTECTED]>
 
 ---
+
+What:  JFFS (version 1)
+When:  2.6.21
+Why:   Unmaintained for years, superceded by JFFS2 for years.
+Who:   Jeff Garzik <[EMAIL PROTECTED]>
+
+---
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take32 0/10] kevent: Generic event handling mechanism.

2007-01-10 Thread Jeff Garzik

Evgeniy Polyakov wrote:

On Wed, Jan 10, 2007 at 06:11:26AM -0500, Jeff Garzik ([EMAIL PROTECTED]) wrote:

Once the rate of change slows, Andrew should IMO definitely pick this up.


There are _tons_ of ideas to implement with kevent - so if we want, rate
will not slow down. As you can see, from take26 I only send new
features: signals, posix timers, AIO, userspace notifications, various
flags and the like. I test it on my machines (recently one them died, so
only amd64 right now (running kernel) and i386 compile-only)
and some bug-fixes withoout any additioanl feature requests (almost,
Ingo asked for AIO before New Year), but broader testing is welcome
indeed.


If the rate doesn't slow (if only artificially), people are discouraged 
from reviewing, because it becomes a moving target.



If you wanted to make this process automatic, create a git branch that 
Andrew and others can pull.


Exported git tree would be good, but I do not have enough disk space on


Request an account on http://www.foo-projects.org/ which supports git. 
The Intel guys use it to send me e1000/ixgb changes, for example.




web-site, and do you really want to read comments written in bad english
with russian transliterated indecent words?


The only thing exported to -mm is the code changes, as a patch.  git 
merely automates the process, so that Andrew doesn't have to spend time 
[that he doesn't have] tracking a project with a high rate of change.



I like the direction so far, and think it should be in -mm for wider 
testing and review.


It was there, but Andrew dropped it somewhere about take25 :)


Probably because it was a moving target with a high rate of change, 
requiring time that Andrew did not have just to keep in sync and fix 
build conflicts with other -mm patches.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take32 0/10] kevent: Generic event handling mechanism.

2007-01-10 Thread Jeff Garzik

Evgeniy Polyakov wrote:

Generic event handling mechanism.

Kevent is a generic subsytem which allows to handle event notifications.
It supports both level and edge triggered events. It is similar to
poll/epoll in some cases, but it is more scalable, it is faster and
allows to work with essentially eny kind of events.

Events are provided into kernel through control syscall and can be read
back through ring buffer or using usual syscalls.
Kevent update (i.e. readiness switching) happens directly from internals
of the appropriate state machine of the underlying subsytem (like
network, filesystem, timer or any other).

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Documentation page:
http://linux-net.osdl.org/index.php/Kevent

Consider for inclusion.

With this release I start 3 days resending timeout - i.e. each third day 
I will send either new version (if something new was requested and agreed 
to be implemented) or resending with back counter started from three. When 
back counter hits zero after three resendings I consider there is no interest 
in subsystem and I will stop further sending. 

I really doubt it is a good way to tell the world about my work, and I bet you 
all tired from those pathos words, but I really would like to get some feedback,
since I want to start to work on network AIO, but sending mails into 
unfeedbackable 'destination' really does not motivate me for that.


Thanks for understanding and your time.


Once the rate of change slows, Andrew should IMO definitely pick this up.

If you wanted to make this process automatic, create a git branch that 
Andrew and others can pull.


I like the direction so far, and think it should be in -mm for wider 
testing and review.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Delete JFFS (version 1)

2006-12-12 Thread Jeff Garzik

Bill Nottingham wrote:
Jeff Garzik ([EMAIL PROTECTED]) said: 
It's always been the case that we remove Linux kernel code when the 
number of users (and more importantly, developers) drops to near-nil.


So, drivers/net/3c501.c?


Depends on how motivated Alan remains ;-)  Historically, if the 
developer is active, we have occasionally ignored the miniscule userbase.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Delete JFFS (version 1)

2006-12-12 Thread Jeff Garzik

Jeff Garzik wrote:
When it's more likely to get struck by lightning than encounter 
filesystem X on a random hard drive in the field, filesystem X need not 
be in the kernel.


As people are already poking me:)  I course meant "flash device" not 
"hard drive".


SATA maintainer's curse, I suppose, to think of all storage devices as 
hard drives, no matter how incorrect that might be :)


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] Delete JFFS (version 1)

2006-12-12 Thread Jeff Garzik

Josh Boyer wrote:

On 12/12/06, Jeff Garzik <[EMAIL PROTECTED]> wrote:

I have created the 'kill-jffs' branch of
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/misc-2.6.git that
removes fs/jffs.

I argue that you can count the users (who aren't on 2.4) on one hand,
and developers don't seem to have cared for it in ages.

People are already talking about jffs2 replacements, so I propose we zap
jffs in 2.6.21.


I'm usually all for killing broken code, but JFFS isn't really broken
is it?  Is there some burden it's causing by being in the kernel at
the moment?


It's always been the case that we remove Linux kernel code when the 
number of users (and more importantly, developers) drops to near-nil.


Every line of code is one more place you have to audit when code 
changes, one more place to update each time the VFS API is touched.


When it's more likely to get struck by lightning than encounter 
filesystem X on a random hard drive in the field, filesystem X need not 
be in the kernel.


IMO, of course :)

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC] Delete JFFS (version 1)

2006-12-12 Thread Jeff Garzik
I have created the 'kill-jffs' branch of 
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/misc-2.6.git that 
removes fs/jffs.


I argue that you can count the users (who aren't on 2.4) on one hand, 
and developers don't seem to have cared for it in ages.


People are already talking about jffs2 replacements, so I propose we zap 
jffs in 2.6.21.


Jeff



diff --git a/Documentation/feature-removal-schedule.txt 
b/Documentation/feature-removal-schedule.txt
index 46f2a55..c008303 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -270,3 +270,10 @@ Why:   The new layer 3 independant connection tracking 
replaces the old
 Who:   Patrick McHardy <[EMAIL PROTECTED]>
 
 ---
+
+What:  JFFS (version 1) filesystem
+When:  2.6.21
+Why:   No users or developers
+Who:   Jeff Garzik <[EMAIL PROTECTED]>
+
+---


Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-19 Thread Jeff Garzik

Here's a dumb question, and I apologize if I am questioning computer
science dogma...

Why are LVM and EVMS(competing LVM project) needed at all?

Surely the same can be accomplished with
* md
* snapshot blkdev (attached in previous e-mail)
* giving partitions and blkdevs the ability to grow and shrink
* giving filesystems the ability to grow and shrink

On-line optimization (defrag, etc) shouldn't be hard once you have the
ability to move blocks and files around, which would come with the
ability to grow and shrink blkdevs and fs's.

-- 
Jeff Garzik  | "Do you have to make light of everything?!"
Building 1024| "I'm extremely serious about nailing your
MandrakeSoft |  step-daughter, but other than that, yes."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-19 Thread Jeff Garzik

Linus Torvalds wrote:
> There are some strong arguments that we should have filesystem
> "backdoors" for maintenance purposes, including backup.

I think I agree with something Al said over IRC, that fs-level snapshots
are preferred over block level snapshots.

fs-level snapshots should become easy if you have a generic transaction
layer.  The OS spits out file ops, which get processed into a set of fs
transactions.  (remember that fs-level stuff like "change this block
bitmap" is also a transaction, just like the more generic "update this
inode's mtime")

Also, I think there should be generic block allocation strategies that
fs's can use.  Implementing fs-specific strategies such as ext2's
readahead or XFS's delayed allocation is not a solution, IMHO, but
working towards solving the real problem.



> You can, of course, so parts of this on a LVM level, and doing backups
> with "disk snapshots" may be a valid approach. However, even that is
> debatable: there is very little that says that the disk image has to be
> up-to-date at any particular point in time, so even with a disk snapshot
> capability (which is not necessarily reasonable under all circumstances)
> there are arguments for maintenance interfaces.

I've been hacking on the attached, a snapshot block device driver, which
doesn't require LVM at all.  (warning: compiled and updated per outside
review, but very alpha...  do not apply)

The point of the driver is to provide a sync point at snapshot time, at
which all metadata and data is flushed to the block device.

My question... is there a fundamental flaw in this plan?  Ideally when
userspace says "start snapshot", the fsync_dev occurs [a
simplification].  At that point, userspace can safely run dump or tar or
whatever on the virtual snapshot device.

-- 
Jeff Garzik  | "Do you have to make light of everything?!"
Building 1024| "I'm extremely serious about nailing your
MandrakeSoft |  step-daughter, but other than that, yes."

Index: linux_2_4/drivers/block/Config.in
diff -u linux_2_4/drivers/block/Config.in:1.1.1.44 
linux_2_4/drivers/block/Config.in:1.1.1.44.4.1
--- linux_2_4/drivers/block/Config.in:1.1.1.44  Tue May 15 04:43:24 2001
+++ linux_2_4/drivers/block/Config.in   Wed May 16 15:44:59 2001
@@ -46,4 +46,6 @@
 fi
 dep_bool '  Initial RAM disk (initrd) support' CONFIG_BLK_DEV_INITRD 
$CONFIG_BLK_DEV_RAM
 
+tristate 'Snapshot device support' CONFIG_BLK_DEV_SNAP
+
 endmenu
Index: linux_2_4/drivers/block/Makefile
diff -u linux_2_4/drivers/block/Makefile:1.1.1.46 
linux_2_4/drivers/block/Makefile:1.1.1.46.4.1
--- linux_2_4/drivers/block/Makefile:1.1.1.46   Tue May 15 04:43:24 2001
+++ linux_2_4/drivers/block/MakefileWed May 16 15:44:59 2001
@@ -31,6 +31,7 @@
 obj-$(CONFIG_BLK_DEV_DAC960)   += DAC960.o
 
 obj-$(CONFIG_BLK_DEV_NBD)  += nbd.o
+obj-$(CONFIG_BLK_DEV_SNAP) += snap.o
 
 subdir-$(CONFIG_PARIDE) += paride
 
Index: linux_2_4/drivers/block/snap.c
diff -u /dev/null linux_2_4/drivers/block/snap.c:1.1.6.10
--- /dev/null   Sat May 19 17:36:30 2001
+++ linux_2_4/drivers/block/snap.c  Thu May 17 11:48:54 2001
@@ -0,0 +1,1055 @@
+/*
+   Copyright 2001 Jeff Garzik <[EMAIL PROTECTED]>
+   Copyright (C) 2000 Jens Axboe <[EMAIL PROTECTED]>
+  
+   May be copied or modified under the terms of the GNU General Public
+   License.  See linux/COPYING for more information.
+  
+   Several ideas and some code taken from Jens Axboe's pktcdvd.c 0.0.2j.
+  
+   To-Do list:
+   * Write support.  It's easy, and might be useful in isolated circumstances.
+   * Convert MAX_SNAPDEVS to a module parameter.
+   * Wrap use of "%" operator, to prepare for 64-bit-sized blockdevs on 
+ 32-bit processors
+  
+ */
+
+#define VERSION_CODE   "v0.5.0-take6  17 May 2001  Jeff Garzik 
+<[EMAIL PROTECTED]>"
+#define MODNAME"snap"
+#define PFXMODNAME ": "
+#define MAX_SNAPDEVS   16 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int *snap_sizes;
+static int *snap_blksize;
+static int *snap_readahead;
+static struct snap_device *snap_devs;
+static int snap_major = -1;
+static spinlock_t snap_lock = SPIN_LOCK_UNLOCKED;
+
+
+/*
+ * a bit of a kludge, but we want to be able to pass source, log,
+ * or snap dev and get the right one.
+ */
+static struct snap_device *snap_find_dev(kdev_t dev)
+{
+   int i, j;
+   struct snap_device *sd;
+
+   spin_lock(&snap_lock);
+
+   for (i = 0; i < MAX_SNAPDEVS; i++) {
+   sd = &snap_devs[i];
+   if ((sd->src.dev == dev) || (sd->snap_dev == dev))
+   goto out;
+   for (j = 0; j < sd->n_logs; j++)
+   if (sd-&g

Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-19 Thread Jeff Garzik

Jeff Garzik wrote:
> Notice also a "metadata miscdev" solves the problem of passing options
> on open -- just pass those options to the miscdev before you open it...

to be more clear, "it" == the data device, not the metadata miscdev

-- 
Jeff Garzik  | "Do you have to make light of everything?!"
Building 1024| "I'm extremely serious about nailing your
MandrakeSoft |  step-daughter, but other than that, yes."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-19 Thread Jeff Garzik

Are we talking about device arguments just for chrdevs and blkdevs? 
(ie. drivers)  or for regular files too?

Speaking about drivers specifically, a controlling miscdev, one per
device or one per group of devices depending on your needs, is a much
more clean solution for passing ioctl-type data.  You are free to come
up with whatever method of communication with the driver is most
efficient for your needs -- without perverting open(2).

Notice also a "metadata miscdev" solves the problem of passing options
on open -- just pass those options to the miscdev before you open it...

metadata miscdevs are a clean solution to what procfs hacks and ioctls
are trying to accomplish.

Jeff


-- 
Jeff Garzik  | "Do you have to make light of everything?!"
Building 1024| "I'm extremely serious about nailing your
MandrakeSoft |  step-daughter, but other than that, yes."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: ext3 for 2.4

2001-05-17 Thread Jeff Garzik

AFAIK the original stated intention of ext3 was

cd linux/fs
cp -a ext2 ext3
# hack on ext3

That leaves ext2 in ultra-stability,
no-patches-unless-absolutely-necessary mode.

IMHO prove a new feature, like directories in page cache, journaling,
etc. in ext3 first.  Then maybe after a year of testing, if people
actually care, backport those features to ext2.

-- 
Jeff Garzik  | Game called on account of naked chick
Building 1024|
MandrakeSoft |
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



PATCH 2.4.0.11.1: ramfs fix for highmem

2000-11-08 Thread Jeff Garzik

ramfs calls memset(page_address(page),...) on a page which might be in
highmem.

This had been mentioned before on lkml, I noticed, but it never made it
into the kernel.  I noticed and changed the same thing when I was
hacking on tmpfs, so might as well make sure this gets into the kernel.

There is also another patch on lkml for ramfs, one which adds resource
limits.  Ug, it can be done so much better with mount options. 
Anyway...  I'm straying off topic.

-- 
Jeff Garzik | "When I do this, my computer freezes."
Building 1024   |  -user
MandrakeSoft| "Don't do that."
|  -level 1

Index: fs/ramfs/inode.c
===
RCS file: /cvsroot/gkernel/linux_2_4/fs/ramfs/inode.c,v
retrieving revision 1.1.1.5
diff -u -r1.1.1.5 inode.c
--- fs/ramfs/inode.c2000/10/22 21:52:44 1.1.1.5
+++ fs/ramfs/inode.c2000/11/08 17:28:33
@@ -65,7 +65,8 @@
 static int ramfs_readpage(struct file *file, struct page * page)
 {
if (!Page_Uptodate(page)) {
-   memset(page_address(page), 0, PAGE_CACHE_SIZE);
+   memset(kmap(page), 0, PAGE_CACHE_SIZE);
+   kunmap(page);
flush_dcache_page(page);
SetPageUptodate(page);
}



tmpfs update...

2000-11-08 Thread Jeff Garzik

Attached is another shot at tmpfs.  I use my own vm_ops, where the only
member initialized is nopage (==filemap_nopage).  In particular,
swapout==NULL, so that try_to_swap_out will swap out pages for us.

Of course, it's still broken, with pretty much the same behavior as
before -- things don't seem to be getting swapped out correctly, so once
physical RAM is exhausted, things break.

Note for reading -- the code is now pretty much the same as ramfs again,
with the exception that we use own our mmap function to hook in the
custom vm_ops.

Comments appreciated,

Jeff


-- 
Jeff Garzik | "When I do this, my computer freezes."
Building 1024   |  -user
MandrakeSoft| "Don't do that."
|  -level 1

/*
 * Resizable simple ram filesystem for Linux.
 * Hacked into tmpfs by Jeff Garzik
 *
 * Copyright (C) 2000 Linus Torvalds.
 *   2000 Transmeta Corp.
 *
 * ramfs->tmpfs hacks by Jeff Garzik <[EMAIL PROTECTED]>
 *
 * This file is released under the GPL.
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 


/* some random number */
#define TMPFS_MAGIC 0xBEDAC0ED


static struct super_operations tmpfs_ops;
static struct address_space_operations tmpfs_aops;
static struct file_operations tmpfs_dir_operations;
static struct file_operations tmpfs_file_operations;
static struct inode_operations tmpfs_dir_inode_operations;


static int tmpfs_statfs(struct super_block *sb, struct statfs *buf)
{
buf->f_type = TMPFS_MAGIC;
buf->f_bsize = PAGE_CACHE_SIZE;
buf->f_namelen = 255;
return 0;
}

/*
 * Lookup the data. This is trivial - if the dentry didn't already
 * exist, we know it is negative.
 */
static struct dentry * tmpfs_lookup(struct inode *dir, struct dentry *dentry)
{
d_add(dentry, NULL);
return NULL;
}

/*
 * Read a page. Again trivial. If it didn't already exist
 * in the page cache, it is zero-filled.
 */
static int tmpfs_readpage(struct file *file, struct page * page)
{
if (!PageActive(page))
BUG();
if (!Page_Uptodate(page)) {
void *addr = (void*) kmap(page);
memset(addr, 0, PAGE_CACHE_SIZE);
kunmap(page);
flush_dcache_page(page);
SetPageUptodate(page);
}
SetPageDirty(page);
UnlockPage(page);
return 0;
}

static int tmpfs_prepare_write(struct file *file, struct page *page, unsigned offset, 
unsigned to)
{
void *addr;

addr = (void *) kmap(page);
if (!Page_Uptodate(page)) {
memset(addr, 0, PAGE_CACHE_SIZE);
flush_dcache_page(page);
SetPageUptodate(page);
}
SetPageDirty(page);
return 0;
}

static int tmpfs_commit_write(struct file *file, struct page *page, unsigned offset, 
unsigned to)
{
struct inode *inode = (struct inode*)page->mapping->host;
loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;

kunmap(page);
if (pos > inode->i_size)
inode->i_size = pos;
return 0;
}

static struct vm_operations_struct tmpfs_mmap_ops = {
nopage: filemap_nopage,
};

/* This is used for a general mmap of a disk file */

static int tmpfs_file_mmap(struct file * file, struct vm_area_struct * vma)
{
struct vm_operations_struct * ops;
struct inode *inode = file->f_dentry->d_inode;

ops = &tmpfs_mmap_ops;
if (!inode->i_sb || !S_ISREG(inode->i_mode))
return -EACCES;
if (!inode->i_mapping->a_ops->readpage)
return -ENOEXEC;
UPDATE_ATIME(inode);
vma->vm_ops = ops;
return 0;
}

static struct inode *tmpfs_get_inode(struct super_block *sb, int mode, int dev)
{
struct inode * inode = get_empty_inode();

if (inode) {
inode->i_sb = sb;
inode->i_dev = sb->s_dev;
inode->i_mode = mode;
inode->i_uid = current->fsuid;
inode->i_gid = current->fsgid;
inode->i_size = 0;
inode->i_blksize = PAGE_CACHE_SIZE;
inode->i_blocks = 0;
inode->i_rdev = to_kdev_t(dev);
inode->i_nlink = 1;
inode->i_op = NULL;
inode->i_fop = NULL;
inode->i_mapping->a_ops = &tmpfs_aops;
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
inode->u.generic_ip = NULL;
switch (mode & S_IFMT) {
default:
init_special_inode(inode, mode, dev);
break;
   

PATCH: tmpfs

2000-11-07 Thread Jeff Garzik

Here's a quick one-night hack of ramfs to make it swap... ie. tmpfs.  If
some of the VM gurus could look over it, that would be great.  It works
great until physical RAM is exhausted, then... infinite swap :)

My current approach is to swap out pages "manually" in
address_space::writepage, and read them back in when ::readpage is
called.  A red-black tree of swapped-out pages is maintained for each
inode in RAM.  metadata is never swapped out, only data.

Alternative approach:  using a custom vm_operations, set swapout==NULL. 
This forces try_to_swap_out() to swap the page out.  ::writepage becomes
very simple then, but ::readpage becomes more complex.

Comments welcome...   I think something -simple- like this can be used
to create tmpfs.  I looked at the "shmfs" code, and it was huge compared
to this...

Jeff, the VM newbie


-- 
Jeff Garzik | "When I do this, my computer freezes."
Building 1024   |  -user
MandrakeSoft| "Don't do that."
|  -level 1

Index: linux_2_4/fs/Config.in
diff -u linux_2_4/fs/Config.in:1.1.1.5 linux_2_4/fs/Config.in:1.1.1.5.18.3
--- linux_2_4/fs/Config.in:1.1.1.5  Sun Oct 22 14:51:49 2000
+++ linux_2_4/fs/Config.in  Mon Nov  6 23:39:52 2000
@@ -30,6 +30,7 @@
 fi
 tristate 'Compressed ROM file system support' CONFIG_CRAMFS
 tristate 'Simple RAM-based file system support' CONFIG_RAMFS
+dep_bool 'Simple VM-backed, RAM-based file system support' CONFIG_TMPFS 
+$CONFIG_EXPERIMENTAL
 
 tristate 'ISO 9660 CDROM file system support' CONFIG_ISO9660_FS
 dep_mbool '  Microsoft Joliet CDROM extensions' CONFIG_JOLIET $CONFIG_ISO9660_FS
Index: linux_2_4/fs/Makefile
diff -u linux_2_4/fs/Makefile:1.1.1.5 linux_2_4/fs/Makefile:1.1.1.5.18.2
--- linux_2_4/fs/Makefile:1.1.1.5   Sun Oct 22 14:51:44 2000
+++ linux_2_4/fs/Makefile   Mon Nov  6 20:25:40 2000
@@ -29,6 +29,7 @@
 subdir-$(CONFIG_EXT2_FS)   += ext2
 subdir-$(CONFIG_CRAMFS)+= cramfs
 subdir-$(CONFIG_RAMFS) += ramfs
+subdir-$(CONFIG_TMPFS) += tmpfs
 subdir-$(CONFIG_CODA_FS)   += coda
 subdir-$(CONFIG_MINIX_FS)  += minix
 subdir-$(CONFIG_FAT_FS)+= fat
Index: linux_2_4/fs/tmpfs/Makefile
diff -u /dev/null linux_2_4/fs/tmpfs/Makefile:1.1.2.1
--- /dev/null   Tue Nov  7 00:36:29 2000
+++ linux_2_4/fs/tmpfs/Makefile Mon Nov  6 20:25:40 2000
@@ -0,0 +1,11 @@
+#
+# Makefile for the linux tmpfs routines.
+#
+
+O_TARGET := tmpfs.o
+
+O_OBJS := inode.o
+
+M_OBJS := $(O_TARGET)
+
+include $(TOPDIR)/Rules.make
Index: linux_2_4/fs/tmpfs/inode.c
diff -u /dev/null linux_2_4/fs/tmpfs/inode.c:1.1.2.4
--- /dev/null   Tue Nov  7 00:36:29 2000
+++ linux_2_4/fs/tmpfs/inode.c  Tue Nov  7 00:25:07 2000
@@ -0,0 +1,552 @@
+/*
+ * Resizable simple ram filesystem for Linux.
+ * Hacked into tmpfs by Jeff Garzik
+ *
+ * Copyright (C) 2000 Linus Torvalds.
+ *   2000 Transmeta Corp.
+ *
+ * ramfs->tmpfs hacks by Jeff Garzik <[EMAIL PROTECTED]>
+ *
+ * This file is released under the GPL.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "rbtree.h"
+#include "rbtree.c"
+
+/* some random number */
+#define TMPFS_MAGIC0xBEDAC0ED
+
+#define tmpfs_ent_g(n) list_entry(n, struct tmpfs_swap_ent, node)
+#define tmpfs_for_each_ent(ent) \
+   for(ent = tmpfs_ent_g(ti->swap_entries.next); \
+   ent != tmpfs_ent_g(&ti->swap_entries); \
+   ent = tmpfs_ent_g(ent->node.next))
+
+static struct super_operations tmpfs_ops;
+static struct address_space_operations tmpfs_aops;
+static struct file_operations tmpfs_dir_operations;
+static struct file_operations tmpfs_file_operations;
+static struct inode_operations tmpfs_dir_inode_operations;
+static kmem_cache_t *swap_ent_cache;
+
+
+struct tmpfs_swap_ent {
+   rb_node_t node;
+   swp_entry_t ent;
+   struct page *page;
+};
+
+
+static inline struct tmpfs_swap_ent *
+rb_search_page_cache(struct inode *inode, struct page *page)
+{
+   rb_node_t * n = inode->u.generic_ip;
+   struct tmpfs_swap_ent *ent;
+
+   while (n)
+   {
+   ent = rb_entry(n, struct tmpfs_swap_ent, node);
+
+   if (((unsigned long)page) < ((unsigned long)ent->page))
+   n = n->rb_left;
+   else if (((unsigned long)page) > ((unsigned long)ent->page))
+   n = n->rb_right;
+   else {
+   rb_erase(n, (rb_root_t*) &inode->u.generic_ip);
+   return ent;
+   }
+   }
+   return NULL;
+}
+
+
+static inline struct tmpfs_swap_ent *
+__rb_insert_page_cache(struct inode *inode, struct page *page, rb_node_t *node)
+{
+   rb_node_t ** p = (rb_node_t **) &inode->u.gen