Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Wed, 25 Apr 2007, Nikita Danilov wrote: David Lang writes: > On Tue, 24 Apr 2007, Nikita Danilov wrote: > > > David Lang writes: > > > On Tue, 24 Apr 2007, Nikita Danilov wrote: > > > > > > > Amit Gud writes: > > > > > > > > Hello, > > > > > > > > > > > > > > This is an initial implementation of ChunkFS technique, briefly discussed > > > > > at: http://lwn.net/Articles/190222 and > > > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf > > > > > > > > I have a couple of questions about chunkfs repair process. > > > > > > > > First, as I understand it, each continuation inode is a sparse file, > > > > mapping some subset of logical file blocks into block numbers. Then it > > > > seems, that during "final phase" fsck has to check that these partial > > > > mappings are consistent, for example, that no two different continuation > > > > inodes for a given file contain a block number for the same offset. This > > > > check requires scan of all chunks (rather than of only "active during > > > > crash"), which seems to return us back to the scalability problem > > > > chunkfs tries to address. > > > > > > not quite. > > > > > > this checking is a O(n^2) or worse problem, and it can eat a lot of memory in > > > the process. with chunkfs you divide the problem by a large constant (100 or > > > more) for the checks of individual chunks. after those are done then the final > > > pass checking the cross-chunk links doesn't have to keep track of everything, it > > > only needs to check those links and what they point to > > > > Maybe I failed to describe the problem presicely. > > > > Suppose that all chunks have been checked. After that, for every inode > > I0 having continuations I1, I2, ... In, one has to check that every > > logical block is presented in at most one of these inodes. For this one > > has to read I0, with all its indirect (double-indirect, triple-indirect) > > blocks, then read I1 with all its indirect blocks, etc. And to repeat > > this for every inode with continuations. > > > > In the worst case (every inode has a continuation in every chunk) this > > obviously is as bad as un-chunked fsck. But even in the average case, > > total amount of io necessary for this operation is proportional to the > > _total_ file system size, rather than to the chunk size. > > actually, it should be proportional to the number of continuation nodes. The > expectation (and design) is that they are rare. Indeed, but total size of meta-data pertaining to all continuation inodes is still proportional to the total file system size, and so is fsck time: O(total_file_system_size). correct, but remember that in the real world O(total_file_system_size) does not mean that it can't work well. it just means that larger filesystems will take longer to check. they aren't out to eliminate the need for fsck, just to be able to divide the time it currently takes by a large value so that as the filesystems continue to get larger it is still reasonable to check them What is more important, design puts (as far as I can see) no upper limit on the number of continuation inodes, and hence, even if _average_ fsck time is greatly reduced, occasionally it can take more time than ext2 of the same size. This is clearly unacceptable in many situations (HA, etc.). in a pathalogical situation you are correct, it would take longer. however before declaring that this is completely unacceptable why don't you wait and see if the pathalogical situation is at all likely? when you are doing ha with shared storage you tend to be doing things like databases, every database that I know about splits it's data files into many pieces of a fixed size. Postgres for example does 1M files. if you do a chunk size of 1G it's very unlikly that more then a couple files out of every thousand will end up with continuation nodes. remember that the current thinking on chunk size is to make the chunks be ~1% of your filesystem, so on a 1TB filesystem your chunk size would be 10G (which, in the example above would mean just a couple files out of every ten thousand would have continuation nodes). with the current filesystems it's _possible_ for a file to be spread out across the disk such that it's first block is at the beginning of the disk, the second at the end of the disk, the third back at the beginning, the fourth at the end, etc. but users don't worry about this when useing the filesystems becouse the odds of this happening under normal use are vanishingly small (and the filesystem designers work to make the odds this small). similarly the chunkfs designers are working to make the odds of every file having a continuation nodes vanishingly small as well. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Tue, 24 Apr 2007, Nikita Danilov wrote: David Lang writes: > On Tue, 24 Apr 2007, Nikita Danilov wrote: > > > Amit Gud writes: > > > > Hello, > > > > > > > > This is an initial implementation of ChunkFS technique, briefly discussed > > > at: http://lwn.net/Articles/190222 and > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf > > > > I have a couple of questions about chunkfs repair process. > > > > First, as I understand it, each continuation inode is a sparse file, > > mapping some subset of logical file blocks into block numbers. Then it > > seems, that during "final phase" fsck has to check that these partial > > mappings are consistent, for example, that no two different continuation > > inodes for a given file contain a block number for the same offset. This > > check requires scan of all chunks (rather than of only "active during > > crash"), which seems to return us back to the scalability problem > > chunkfs tries to address. > > not quite. > > this checking is a O(n^2) or worse problem, and it can eat a lot of memory in > the process. with chunkfs you divide the problem by a large constant (100 or > more) for the checks of individual chunks. after those are done then the final > pass checking the cross-chunk links doesn't have to keep track of everything, it > only needs to check those links and what they point to Maybe I failed to describe the problem presicely. Suppose that all chunks have been checked. After that, for every inode I0 having continuations I1, I2, ... In, one has to check that every logical block is presented in at most one of these inodes. For this one has to read I0, with all its indirect (double-indirect, triple-indirect) blocks, then read I1 with all its indirect blocks, etc. And to repeat this for every inode with continuations. In the worst case (every inode has a continuation in every chunk) this obviously is as bad as un-chunked fsck. But even in the average case, total amount of io necessary for this operation is proportional to the _total_ file system size, rather than to the chunk size. actually, it should be proportional to the number of continuation nodes. The expectation (and design) is that they are rare. If you get into the worst-case situation of all of them being continuation nodes, then you are actually worse off then you were to start with (as you are saying), but numbers from people's real filesystems (assuming a chunk size equal to a block cluster size) indicates that we are more on the order of a fraction of a percent of the nodes. and the expectation is that since the chunk sizes will be substantially larger then the block cluster sizes this should get reduced even more. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Tue, 24 Apr 2007, Nikita Danilov wrote: Amit Gud writes: Hello, > > This is an initial implementation of ChunkFS technique, briefly discussed > at: http://lwn.net/Articles/190222 and > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf I have a couple of questions about chunkfs repair process. First, as I understand it, each continuation inode is a sparse file, mapping some subset of logical file blocks into block numbers. Then it seems, that during "final phase" fsck has to check that these partial mappings are consistent, for example, that no two different continuation inodes for a given file contain a block number for the same offset. This check requires scan of all chunks (rather than of only "active during crash"), which seems to return us back to the scalability problem chunkfs tries to address. not quite. this checking is a O(n^2) or worse problem, and it can eat a lot of memory in the process. with chunkfs you divide the problem by a large constant (100 or more) for the checks of individual chunks. after those are done then the final pass checking the cross-chunk links doesn't have to keep track of everything, it only needs to check those links and what they point to any ability to mark a filesystem as 'clean' and then not have to check it on reboot is a bonus on top of this. David Lang Second, it is not clear how, under assumption of bugs in the file system code (which paper makes at the very beginning), fsck can limit itself only to the chunks that were active at the moment of crash. [...] > > Best, > AG Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AppArmor FAQ
On Thu, 19 Apr 2007, Stephen Smalley wrote: already happened to integrate such support into userland. To look at it in a slightly different way, the AA emphasis on not modifying applications could be viewed as a limitation. Ultimately, users have security goals that go beyond just what the OS can directly enforce and at least some applications (notably things like X, D-BUS, PostgreSQL, etc) need to likewise support strong domain separation and controlled information flow through their own internal objects and operations. SELinux provides APIs and infrastructure for such applications, and has already done quite a bit of work in that space (D-BUS support, XACE/XSELinux, SE-PostgreSQL), whereas AA seems to have no interest in going there (and would have to recant its emphasis on no application mods to do so). If you actually want to truly confine a desktop application, you can't limit yourself to the kernel. And the ^^^ label model provides a unifying abstraction for dealing with all of these various objects, whereas the path/"natural abstraction" model has no unifying abstraction at all. AA isn't aimed at confineing desktop applications. it's aimed at confining server applications. this really is a easier task (if it happens to be useful for some desktop apps as well, so much the better) David Lang - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AppArmor FAQ
On Wed, 18 Apr 2007, James Morris wrote: On Tue, 17 Apr 2007, Alan Cox wrote: I'm not sure if AppArmor can be made good security for the general case, but it is a model that works in the limited http environment (eg .htaccess) and is something people can play with and hack on and may be possible to configure to be very secure. Perhaps -- until your httpd is compromised via a buffer overflow or simply misbehaves due to a software or configuration flaw, then the assumptions being made about its use of pathnames and their security properties are out the window. since AA defines a whitelist of files that httpd is allowed to access, a comprimised one may be able to mess up it's files, but it's still not going to be able to touch other files on the system. Without security labeling of the objects being accessed, you can't protect against software flaws, which has been a pretty fundamental and widely understood requirement in general computing for at least a decade. this is not true. you don't need to label all object and chunks of memory, you just need to have a way to list (and enforce) the objects and memory that the program is allowed to use. labeling them is one way of doing this, but not the only way. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AppArmor FAQ
remember that the windows NT permission model is theoreticly superior to the Unix permission model. however there are far more insecure windows boxes out there then Unix boxes (even if taken as a percentage of the installed base) I don't think that anyone is claiming that AA is superior to SELinux what people are claiming is that the model AA is proposing can improve security in some cases. to use the example from this thread /etc/resolv.conf if you have a webserver that wants to do a name lookup you can do one of two things 1. in AA configure the webserver to have ro access to /etc/resolv.conf 2. in SELinux tag /etc/resolv.conf, figure out every program on the sytem that accesses it, and then tag those programs with the right permissions. SELinux is designed to be able to make the box safe against root, AA is designed to let the admin harden exposed apps without having to think about the other things on the system. allow people to use each tool for the appropriate task. David Lang On Wed, 18 Apr 2007, Rob Meijer wrote: Date: Wed, 18 Apr 2007 09:21:13 +0200 (CEST) From: Rob Meijer <[EMAIL PROTECTED]> To: Karl MacMillan <[EMAIL PROTECTED]> Cc: James Morris <[EMAIL PROTECTED]>, John Johansen <[EMAIL PROTECTED]>, [EMAIL PROTECTED], [EMAIL PROTECTED], linux-fsdevel@vger.kernel.org Subject: Re: AppArmor FAQ On Tue, April 17, 2007 23:55, Karl MacMillan wrote: On Mon, 2007-04-16 at 20:20 -0400, James Morris wrote: On Mon, 16 Apr 2007, John Johansen wrote: Label-based security (exemplified by SELinux, and its predecessors in MLS systems) attaches security policy to the data. As the data flows through the system, the label sticks to the data, and so security policy with respect to this data stays intact. This is a good approach for ensuring secrecy, the kind of problem that intelligence agencies have. Labels are also a good approach for ensuring integrity, which is one of the most fundamental aspects of the security model implemented by SELinux. Some may infer otherwise from your document. Not only that, the implication that secrecy is only useful to intelligence agencies is pretty funny. Personally, I think that protecting the confidentiality of my data is important (and my bank and health care providers protecting the data they have about me). Type Enforcement was specifically designed to be able to address integrity _and_ confidentiality in a way acceptable to commercial organizations. Karl Shouldn't that be _OR_, as I have always understood confidentialy models like BLP are by their very nature incompatible with integrity models like Biba. Given this incompatibity, I think the idea that BLP style use of lables (ss/* property and the likes) is only usefull to intelligence agencies may well be correct, while usage for integrity models like Biba and CW would be much broader than that. A path based 'least priviledge' (POLP) solution would I think on its own address neither integity nor confidentiality, and next to this would proof to be yet an other 'fat profile' administration hell. Having said that, I feel a path based solution could have great potential if it could be used in conjunction with the object capability model, that I would consider a simple and practical alternative integrity model that does not require lables in an MLS maner, and that extends on the POLP concept in a way that would be largely more practical. That is, using 'thin' path based profiles would become very practical if all further authority can be communicated using handles in the same way that an open file handle can be communicated. Rob - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: compressing intermediate files with LZO on the fly
On Sat, 7 Apr 2007, Willy Tarreau wrote: Hi Al, On Sat, Apr 07, 2007 at 02:32:34PM +0300, Al Boldi wrote: Willy Tarreau wrote: ... for some usages (temporary space), light compression can increase speed. For instance, when processing logs, I get better speed by compressing intermediate files with LZO on the fly. How can you do that on ext3? Also, can you do that on a partition block-io level? No, sorry for the confusion. My scripts simply do : $ lzop -cd file1.lzo | process | lzop -c3 > file2.lzo With decent CPU, you can reach higher read/write data rates than what a single off-the-shelf disk can achieve. For this reason, I think that reiser4 would be worth trying for this particular usage. And in this case, I'm not interested at all in reliability. It's just temporary storage. If the disk fails, I throw it away and buy a new one. I see the same thing with my nightly scripts that do syslog analysis, last year I trimmed 2 hours from the nightly run by processing compressed files instead of uncompressed ones (after I did this I configured it to compress the files as they are rolled, but rolling every 5 min the compression takes <20 seconds, so the compression is < 30 min) now I just need to find a version of split that can compress it's output files. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems
On Wed, 13 Dec 2006, Nikolai Joukov wrote: We have designed a new stackable file system that we called RAIF: Redundant Array of Independent Filesystems. Similar to Unionfs, RAIF is a fan-out file system and can be mounted over many different disk-based, memory, network, and distributed file systems. RAIF can use the stable and maintained code of the other file systems and thus stay simple itself. Similar to standard RAID, RAIF can replicate the data or store it with parity on any subset of the lower file systems. RAIF has three main advantages over traditional driver-level RAID systems: this sounds very interesting. did you see the paper on chunkfs? http://www.usenix.org/events/hotdep06/tech/prelim_papers/henson/henson_html/ this sounds as if it may be something that you would be able to make a functional equivalent to chunkfs with your raid0 mode. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
what makes you think it's safe to say there's only one floppy drive? David Lang On Mon, 21 May 2001, Oliver Xymoron wrote: > On Sat, 19 May 2001, Alexander Viro wrote: > > > Let's distinguish between per-fd effects (that's what name in > > open(name, flags) is for - you are asking for descriptor and telling > > what behaviour do you want for IO on it) and system-wide side effects. > > > > IMO encoding the former into name is perfectly fine, and no write on > > another file can be sanely used for that purpose. For the latter, though, > > we need to write commands into files and here your miscdevices (or procfs > > files, or /dev/foo/ctl - whatever) is needed. > > I'm a little skeptical about the necessity of these per-fd effects in the > first place - after all, Plan 9 does without them. There's only one > floppy drive, yes? No concurrent users of serial ports? The counter that > comes to mind is sound devices supporting multiple opens, but I think > esound and friends are a better solution to that problem. > > What I'd like to see: > > - An interface for registering an array of related devices (almost always > two: raw and ctl) and their legacy device numbers with a single userspace > callout that does whatever /dev/ creation needs to be done. Thus, naming > and permissions live in user space. No "device node is also a directory" > weirdness which is overkill in the vast majority of cases. No kernel names > or permissions leaking into userspace. > > - An unregister_devices that does the same, giving userspace a > chance to persist permissions, etc. > > - A userspace program that keeps a mapping of kernel names to /dev/ names, > permissions, etc. > > - An autofs hook that does the reverse mapping for running with modules > (possibly calling modprobe directly) > > Possible future extension: > > - Allow exporting proc as a large collection of devices. Manage /proc in > userspace on a tmpfs. > > -- > "Love the dolphins," she advised him. "Write by W.A.S.T.E.." > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]