Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread David Lang

On Wed, 25 Apr 2007, Nikita Danilov wrote:


David Lang writes:
> On Tue, 24 Apr 2007, Nikita Danilov wrote:
>
> > David Lang writes:
> > > On Tue, 24 Apr 2007, Nikita Danilov wrote:
> > >
> > > > Amit Gud writes:
> > > >
> > > > Hello,
> > > >
> > > > >
> > > > > This is an initial implementation of ChunkFS technique, briefly 
discussed
> > > > > at: http://lwn.net/Articles/190222 and
> > > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> > > >
> > > > I have a couple of questions about chunkfs repair process.
> > > >
> > > > First, as I understand it, each continuation inode is a sparse file,
> > > > mapping some subset of logical file blocks into block numbers. Then it
> > > > seems, that during "final phase" fsck has to check that these partial
> > > > mappings are consistent, for example, that no two different continuation
> > > > inodes for a given file contain a block number for the same offset. This
> > > > check requires scan of all chunks (rather than of only "active during
> > > > crash"), which seems to return us back to the scalability problem
> > > > chunkfs tries to address.
> > >
> > > not quite.
> > >
> > > this checking is a O(n^2) or worse problem, and it can eat a lot of 
memory in
> > > the process. with chunkfs you divide the problem by a large constant (100 
or
> > > more) for the checks of individual chunks. after those are done then the 
final
> > > pass checking the cross-chunk links doesn't have to keep track of 
everything, it
> > > only needs to check those links and what they point to
> >
> > Maybe I failed to describe the problem presicely.
> >
> > Suppose that all chunks have been checked. After that, for every inode
> > I0 having continuations I1, I2, ... In, one has to check that every
> > logical block is presented in at most one of these inodes. For this one
> > has to read I0, with all its indirect (double-indirect, triple-indirect)
> > blocks, then read I1 with all its indirect blocks, etc. And to repeat
> > this for every inode with continuations.
> >
> > In the worst case (every inode has a continuation in every chunk) this
> > obviously is as bad as un-chunked fsck. But even in the average case,
> > total amount of io necessary for this operation is proportional to the
> > _total_ file system size, rather than to the chunk size.
>
> actually, it should be proportional to the number of continuation nodes. The
> expectation (and design) is that they are rare.

Indeed, but total size of meta-data pertaining to all continuation
inodes is still proportional to the total file system size, and so is
fsck time: O(total_file_system_size).


correct, but remember that in the real world O(total_file_system_size) does not 
mean that it can't work well. it just means that larger filesystems will take 
longer to check.


they aren't out to eliminate the need for fsck, just to be able to divide the 
time it currently takes by a large value so that as the filesystems continue to 
get larger it is still reasonable to check them



What is more important, design puts (as far as I can see) no upper limit
on the number of continuation inodes, and hence, even if _average_ fsck
time is greatly reduced, occasionally it can take more time than ext2 of
the same size. This is clearly unacceptable in many situations (HA,
etc.).


in a pathalogical situation you are correct, it would take longer. however 
before declaring that this is completely unacceptable why don't you wait and see 
if the pathalogical situation is at all likely?


when you are doing ha with shared storage you tend to be doing things like 
databases, every database that I know about splits it's data files into many 
pieces of a fixed size. Postgres for example does 1M files. if you do a chunk 
size of 1G it's very unlikly that more then a couple files out of every thousand 
will end up with continuation nodes.


remember that the current thinking on chunk size is to make the chunks be ~1% of 
your filesystem, so on a 1TB filesystem your chunk size would be 10G (which, in 
the example above would mean just a couple files out of every ten thousand would 
have continuation nodes).


with the current filesystems it's _possible_ for a file to be spread out across 
the disk such that it's first block is at the beginning of the disk, the second 
at the end of the disk, the third back at the beginning, the fourth at the end, 
etc. but users don't worry about this when useing the filesystems becouse the 
odds of this happening under normal use are vanishingly small (and the 
filesystem designers work to make the odds this small). similarly the chunkfs 
designers are working to make the odds of every file having a continuation nodes 
vanishingly small as well.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread David Lang

On Tue, 24 Apr 2007, Nikita Danilov wrote:


David Lang writes:
> On Tue, 24 Apr 2007, Nikita Danilov wrote:
>
> > Amit Gud writes:
> >
> > Hello,
> >
> > >
> > > This is an initial implementation of ChunkFS technique, briefly discussed
> > > at: http://lwn.net/Articles/190222 and
> > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> >
> > I have a couple of questions about chunkfs repair process.
> >
> > First, as I understand it, each continuation inode is a sparse file,
> > mapping some subset of logical file blocks into block numbers. Then it
> > seems, that during "final phase" fsck has to check that these partial
> > mappings are consistent, for example, that no two different continuation
> > inodes for a given file contain a block number for the same offset. This
> > check requires scan of all chunks (rather than of only "active during
> > crash"), which seems to return us back to the scalability problem
> > chunkfs tries to address.
>
> not quite.
>
> this checking is a O(n^2) or worse problem, and it can eat a lot of memory in
> the process. with chunkfs you divide the problem by a large constant (100 or
> more) for the checks of individual chunks. after those are done then the final
> pass checking the cross-chunk links doesn't have to keep track of everything, 
it
> only needs to check those links and what they point to

Maybe I failed to describe the problem presicely.

Suppose that all chunks have been checked. After that, for every inode
I0 having continuations I1, I2, ... In, one has to check that every
logical block is presented in at most one of these inodes. For this one
has to read I0, with all its indirect (double-indirect, triple-indirect)
blocks, then read I1 with all its indirect blocks, etc. And to repeat
this for every inode with continuations.

In the worst case (every inode has a continuation in every chunk) this
obviously is as bad as un-chunked fsck. But even in the average case,
total amount of io necessary for this operation is proportional to the
_total_ file system size, rather than to the chunk size.


actually, it should be proportional to the number of continuation nodes. The 
expectation (and design) is that they are rare.


If you get into the worst-case situation of all of them being continuation 
nodes, then you are actually worse off then you were to start with (as you are 
saying), but numbers from people's real filesystems (assuming a chunk size equal 
to a block cluster size) indicates that we are more on the order of a fraction 
of a percent of the nodes. and the expectation is that since the chunk sizes 
will be substantially larger then the block cluster sizes this should get 
reduced even more.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread David Lang

On Tue, 24 Apr 2007, Nikita Danilov wrote:


Amit Gud writes:

Hello,

>
> This is an initial implementation of ChunkFS technique, briefly discussed
> at: http://lwn.net/Articles/190222 and
> http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf

I have a couple of questions about chunkfs repair process.

First, as I understand it, each continuation inode is a sparse file,
mapping some subset of logical file blocks into block numbers. Then it
seems, that during "final phase" fsck has to check that these partial
mappings are consistent, for example, that no two different continuation
inodes for a given file contain a block number for the same offset. This
check requires scan of all chunks (rather than of only "active during
crash"), which seems to return us back to the scalability problem
chunkfs tries to address.


not quite.

this checking is a O(n^2) or worse problem, and it can eat a lot of memory in 
the process. with chunkfs you divide the problem by a large constant (100 or 
more) for the checks of individual chunks. after those are done then the final 
pass checking the cross-chunk links doesn't have to keep track of everything, it 
only needs to check those links and what they point to


any ability to mark a filesystem as 'clean' and then not have to check it on 
reboot is a bonus on top of this.


David Lang


Second, it is not clear how, under assumption of bugs in the file system
code (which paper makes at the very beginning), fsck can limit itself
only to the chunks that were active at the moment of crash.

[...]

>
> Best,
> AG

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AppArmor FAQ

2007-04-20 Thread David Lang

On Thu, 19 Apr 2007, Stephen Smalley wrote:


already happened to integrate such support into userland.

To look at it in a slightly different way, the AA emphasis on not
modifying applications could be viewed as a limitation.  Ultimately,
users have security goals that go beyond just what the OS can directly
enforce and at least some applications (notably things like X, D-BUS,
PostgreSQL, etc) need to likewise support strong domain separation and
controlled information flow through their own internal objects and
operations.  SELinux provides APIs and infrastructure for such
applications, and has already done quite a bit of work in that space
(D-BUS support, XACE/XSELinux, SE-PostgreSQL), whereas AA seems to have
no interest in going there (and would have to recant its emphasis on no
application mods to do so).  If you actually want to truly confine a
desktop application, you can't limit yourself to the kernel.  And the

  ^^^


label model provides a unifying abstraction for dealing with all of
these various objects, whereas the path/"natural abstraction" model has
no unifying abstraction at all.



AA isn't aimed at confineing desktop applications. it's aimed at confining 
server applications. this really is a easier task (if it happens to be useful 
for some desktop apps as well, so much the better)


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AppArmor FAQ

2007-04-18 Thread David Lang

On Wed, 18 Apr 2007, James Morris wrote:


On Tue, 17 Apr 2007, Alan Cox wrote:


I'm not sure if AppArmor can be made good security for the general case,
but it is a model that works in the limited http environment
(eg .htaccess) and is something people can play with and hack on and may
be possible to configure to be very secure.


Perhaps -- until your httpd is compromised via a buffer overflow or
simply misbehaves due to a software or configuration flaw, then the
assumptions being made about its use of pathnames and their security
properties are out the window.


since AA defines a whitelist of files that httpd is allowed to access, a 
comprimised one may be able to mess up it's files, but it's still not going to 
be able to touch other files on the system.



Without security labeling of the objects being accessed, you can't protect
against software flaws, which has been a pretty fundamental and widely
understood requirement in general computing for at least a decade.


this is not true. you don't need to label all object and chunks of memory, you 
just need to have a way to list (and enforce) the objects and memory that the 
program is allowed to use. labeling them is one way of doing this, but not the 
only way.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AppArmor FAQ

2007-04-18 Thread David Lang
remember that the windows NT permission model is theoreticly superior to the 
Unix permission model.


however there are far more insecure windows boxes out there then Unix boxes 
(even if taken as a percentage of the installed base)


I don't think that anyone is claiming that AA is superior to SELinux

what people are claiming is that the model AA is proposing can improve security 
in some cases.


to use the example from this thread /etc/resolv.conf

if you have a webserver that wants to do a name lookup you can do one of two 
things


1. in AA configure the webserver to have ro access to /etc/resolv.conf

2. in SELinux tag /etc/resolv.conf, figure out every program on the sytem that 
accesses it, and then tag those programs with the right permissions.


SELinux is designed to be able to make the box safe against root, AA is designed 
to let the admin harden exposed apps without having to think about the other 
things on the system.


allow people to use each tool for the appropriate task.

David Lang

On Wed, 18 Apr 2007, Rob Meijer wrote:


Date: Wed, 18 Apr 2007 09:21:13 +0200 (CEST)
From: Rob Meijer <[EMAIL PROTECTED]>
To: Karl MacMillan <[EMAIL PROTECTED]>
Cc: James Morris <[EMAIL PROTECTED]>, John Johansen <[EMAIL PROTECTED]>,
[EMAIL PROTECTED], [EMAIL PROTECTED],
linux-fsdevel@vger.kernel.org
Subject: Re: AppArmor FAQ

On Tue, April 17, 2007 23:55, Karl MacMillan wrote:

On Mon, 2007-04-16 at 20:20 -0400, James Morris wrote:

On Mon, 16 Apr 2007, John Johansen wrote:


Label-based security (exemplified by SELinux, and its predecessors in
MLS systems) attaches security policy to the data. As the data flows
through the system, the label sticks to the data, and so security
policy with respect to this data stays intact. This is a good approach
for ensuring secrecy, the kind of problem that intelligence agencies

have.

Labels are also a good approach for ensuring integrity, which is one of
the most fundamental aspects of the security model implemented by
SELinux.

Some may infer otherwise from your document.



Not only that, the implication that secrecy is only useful to
intelligence agencies is pretty funny. Personally, I think that
protecting the confidentiality of my data is important (and my bank and
health care providers protecting the data they have about me). Type
Enforcement was specifically designed to be able to address integrity
_and_ confidentiality in a way acceptable to commercial organizations.

Karl


Shouldn't that be _OR_, as I have always understood confidentialy
models like BLP are by their very nature incompatible with integrity
models like Biba. Given this incompatibity, I think the idea that
BLP style use of lables (ss/* property and the likes) is only usefull
to intelligence agencies may well be correct, while usage for integrity
models like Biba and CW would be much broader than that.

A path based 'least priviledge' (POLP) solution would I think on its own
address neither integity nor confidentiality, and next to this would
proof to be yet an other 'fat profile' administration hell.

Having said that, I feel a path based solution could have great potential
if it could be used in conjunction with the object capability model, that
I would consider a simple and practical alternative integrity model that
does not require lables in an MLS maner, and that extends on the POLP
concept in a way that would be largely more practical.
That is, using 'thin' path based profiles would become very practical if
all further authority can be communicated using handles in the same way
that an open file handle can be communicated.

Rob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compressing intermediate files with LZO on the fly

2007-04-07 Thread David Lang

On Sat, 7 Apr 2007, Willy Tarreau wrote:


Hi Al,

On Sat, Apr 07, 2007 at 02:32:34PM +0300, Al Boldi wrote:

Willy Tarreau wrote:


... for some usages (temporary space),
light compression can increase speed. For instance, when processing logs,
I get better speed by compressing intermediate files with LZO on the fly.


How can you do that on ext3?

Also, can you do that on a partition block-io level?


No, sorry for the confusion. My scripts simply do :

$ lzop -cd file1.lzo | process | lzop -c3 > file2.lzo

With decent CPU, you can reach higher read/write data rates than what a
single off-the-shelf disk can achieve. For this reason, I think that
reiser4 would be worth trying for this particular usage. And in this case,
I'm not interested at all in reliability. It's just temporary storage. If
the disk fails, I throw it away and buy a new one.


I see the same thing with my nightly scripts that do syslog analysis, last year 
I trimmed 2 hours from the nightly run by processing compressed files instead of 
uncompressed ones (after I did this I configured it to compress the files as 
they are rolled, but rolling every 5 min the compression takes <20 seconds, so 
the compression is < 30 min)


now I just need to find a version of split that can compress it's output files.

David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-15 Thread David Lang

On Wed, 13 Dec 2006, Nikolai Joukov wrote:


We have designed a new stackable file system that we called RAIF:
Redundant Array of Independent Filesystems.

Similar to Unionfs, RAIF is a fan-out file system and can be mounted over
many different disk-based, memory, network, and distributed file systems.
RAIF can use the stable and maintained code of the other file systems and
thus stay simple itself.  Similar to standard RAID, RAIF can replicate the
data or store it with parity on any subset of the lower file systems.  RAIF
has three main advantages over traditional driver-level RAID systems:


this sounds very interesting. did you see the paper on chunkfs? 
http://www.usenix.org/events/hotdep06/tech/prelim_papers/henson/henson_html/


this sounds as if it may be something that you would be able to make a 
functional equivalent to chunkfs with your raid0 mode.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-21 Thread David Lang

what makes you think it's safe to say there's only one floppy drive?

David Lang

On Mon, 21 May 2001, Oliver Xymoron wrote:

> On Sat, 19 May 2001, Alexander Viro wrote:
>
> > Let's distinguish between per-fd effects (that's what name in
> > open(name, flags) is for - you are asking for descriptor and telling
> > what behaviour do you want for IO on it) and system-wide side effects.
> >
> > IMO encoding the former into name is perfectly fine, and no write on
> > another file can be sanely used for that purpose. For the latter, though,
> > we need to write commands into files and here your miscdevices (or procfs
> > files, or /dev/foo/ctl - whatever) is needed.
>
> I'm a little skeptical about the necessity of these per-fd effects in the
> first place - after all, Plan 9 does without them.  There's only one
> floppy drive, yes? No concurrent users of serial ports? The counter that
> comes to mind is sound devices supporting multiple opens, but I think
> esound and friends are a better solution to that problem.
>
> What I'd like to see:
>
> - An interface for registering an array of related devices (almost always
> two: raw and ctl) and their legacy device numbers with a single userspace
> callout that does whatever /dev/ creation needs to be done. Thus, naming
> and permissions live in user space. No "device node is also a directory"
> weirdness which is overkill in the vast majority of cases. No kernel names
> or permissions leaking into userspace.
>
> - An unregister_devices that does the same, giving userspace a
> chance to persist permissions, etc.
>
> - A userspace program that keeps a mapping of kernel names to /dev/ names,
> permissions, etc.
>
> - An autofs hook that does the reverse mapping for running with modules
> (possibly calling modprobe directly)
>
> Possible future extension:
>
> - Allow exporting proc as a large collection of devices. Manage /proc in
> userspace on a tmpfs.
>
> --
>  "Love the dolphins," she advised him. "Write by W.A.S.T.E.."
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]