Re: Multiple devfs mounts
Richard Gooch [EMAIL PROTECTED] writes: Hi, Al. You've previously stated that you consider the multiple mount feature of devfs broken. I agree that there are some races in there. However, I'm not clear on whether you're saying that the entire concept is broken, or that it can be fixed with appropriate loffcking. I've asked this before, but haven't had a response. Last I saw it was his complaint that you varied what you showed at different mount points, and that doing that all in one dcache tree was fundamentally broken. If you feel that it's fundamentally impossible to mount a FS multiple times, please explain your reasoning. At this point it would make sense to just use the generic multiple mount features in the VFS that Alexander has been putting in. Eric
Re: [Fwd: [linux-audio-dev] info point on linux hdr]
Andrew Clausen [EMAIL PROTECTED] writes: Hi all, Any comments?From: Paul Barton-Davis [EMAIL PROTECTED] Subject: [linux-audio-dev] info point on linux hdr To: [EMAIL PROTECTED] Date: Fri Apr 14 07:10:10 2000 -0500 i mentioned in some remarks to benno how important i thought it was to preallocate the files used for hard disk recording under linux. i was doing more work on ardour yesterday, and had the occasion to create some new "tapes", of lengths from 2 to 40 minutes. the simple act of creating 24 5 minute WAV files on ext2, files in which every block has been allocated, takes at least 1 minute, perhaps as much as 2. i create the files by first opening them, then writing a zero byte every blocksize bytes (blocksize comes from stat(2)), except for the WAV header at the beginning. I'm confused. You wrote what is conceptually 24x5=120 minutes of audio data in 2 minutes. So you have 60 times the performance you need. What is the problem? i hadn't done this in a while, but it reminded me of the near-impossibility of expecting the FS code to allocate this much disk space in real time under ext2. If someone knows of a faster way to allocate the disk space, please let me know. Its not that I have a problem right now, just wanted to point out this behaviour. If you blocksize was 1k increasing it to 4k could help. Eric
Re: Ext2 block size
Brian Pomerantz [EMAIL PROTECTED] writes: [snip] Unfortunately, the smallest stripe size I could get on the Mylex card is 8KB. When I tried that size, I was getting half the write throughput compared to a 64KB stripe size. Also, to really have optimal performance I need to use at least an 8KB block size (to match the Alpha page size). Why does matching the the alpha page size give you better performance? After looking at many of the current hard drives, I noticed that the physical block size is variable depending on where you are on the platter. Almost all of the drives I looked up had a maximum transfer size of 64KB for a single transaction. I'm wondering if making the stripe size equal this transfer size * data drives would actually be the optimal configuration. More likely setting filesystem read-ahead to transfer-size * data drives would give you the performance you arre looking for. I think the consensus here at LLNL is that for the scale that we are working with, a larger block size is better. You are dealing with very large files? We have also found that if you can match the stripe size to the block size in a RAID 5, your write performance will be at least twice as good compared to not matching these up. On the ASCI Blue system, we saw a factor of 10 performance increase on the RAID 5 systems when we matched up the block size to the stripe size. The reason behind this is you don't have to worry about the RAID controller doing a read-modify-write of a stripe because your block is the same size as the stripe. I seem to be seeing a strong bias to bigger is better in the way your are looking at things. And certainly it works that way when you are looking at benchmarks, and the performance numbers. However bigger is better generally breaks down when you look at the sizes of files you need to store. Most files tend to be small causing large block sizes to be very inefficient in terms of space. The distribution from my pc can be seen below. Where it is clear that with a 8K block size 76% of the blocks would be less than half full. Also matching strip size to block size sounds reasonable, but don't forget that for a small file on a large block the read/modify/write cycle must happen. Either internal to the controller or externally in the operating system caches. So generally keeping the block size small to avoid read/modify/write is a win. To optimize space/performance a ext2 typically does reads/writes of many small contiguous blocks all at one time. Which on a typical disk yields performance as good as large block sizes without the extra size penalty. Eric 0 - 511 15.76% 47773 512 - 1023 16.70% 50622 1024 - 2047 24.45% 74109 2048 - 4095 19.26% 58379 4096 - 8191 10.35% 31364 8192 - 16383 6.66% 20179 16384 - 32767 3.74% 11347 32768 - 65535 1.73% 5255 65536 - 131071 0.73% 2212 131072 - 262143 0.33% 986 262144 - 524287 0.14% 437 524288 - 1048575 0.07% 217 1048576 - 2097151 0.03% 88 2097152 - 4194303 0.02% 46 4194304 - 8388607 0.01% 22 8388608 - 16777215 0.00% 8 16777216 - 33554431 0.00% 2 33554432 - 67108863 0.00% 1 --- 303047
Re: COW snapshotting
Rik van Riel [EMAIL PROTECTED] writes: I know this could involve an extra copy of the page for writing, but it fits in really well with the transactioning scheme and could, with the proper API, give us a way to implement a high-performance transactioning interface for userspace. The user space side of this idea seems completely silly. Plus user space has more issues to contend with. For filesystem metadata journally is used exclusively to be able to recover a consistent filesystem after an unexpected software disappearance (power outage, kernel crash etc). For user space transactions (i.e. what a user space program would use in a database) are also used to sort out multiple processes that perform simultaneous actions without locking. Consider the following points: (a) Transactions usually succeed. (b) Transactions can be huge and involve multiple files. (c) Commits happen after the end of every transaction, and their length affects the latency of operations. In a large transaction, especially one that creates new data (so the rollback logs can be very small), you want the data to go to disk like a normal file write. Then when the commit eventually happens you hardly need to wait at all, and only the rollback log needs to be cleaned up. Consider loading a gigabyte of data into your database in one transaction. Equally there is the question of what happens if you have two records on the same page, being updated in different transactions. Your scheme would either force one transaction to wait for the other, or force the transactions to merge. Not nice. I won't argue that building a nice user space API isn't a good idea. But we need to get fs level journalling into the kernel first, and we need to think about things more carefully. This isn't like implementing vfork where the question was what is the best way to stop the parent process, because clone could already cleanly handle the tricky address space shareing. Eric
Re: Location of shmfs?
Jeff Garzik [EMAIL PROTECTED] writes: Does anyone know where I can find code for shmfs? Try: http://www.users.uswest.net/~ebiederm/files/ Sorry. I'm in the middle of switching ISP. . . Eric
Re: Location of shmfs?
[EMAIL PROTECTED] (Eric W. Biederman) writes: Jeff Garzik [EMAIL PROTECTED] writes: Does anyone know where I can find code for shmfs? Try: http://www.users.uswest.net/~ebiederm/files/ Make that: http://www.users.uswest.net/~ebiederman/files/ They had to put in my full last name grr. Eric
Re: (reiserfs) Re: RE: journal requirements for buffer.c (was: Roma progress report)
Hans Reiser [EMAIL PROTECTED] writes: I feel we should encourage Linus to allow the following: * unions in struct buffer_head and struct page containing filesystem specific fields comparable to the union in struct inode. No. In struct buffer_head I don't have problems. In struct page (you don't want to pay the overhead for every page of memory)... There is the buffer_head *bh pointer that could be made more generic. * a filesystem operation which flushes to disk a buffer or page, and leaves it to the filesystem whether to flush a whole commit along with it, or a whole bunch of semantically adjacent buffers, or to repack the buffers before writing them, or to write a whole bunch of semantically nearby buffers, or to fill the rest of the RAID stripe, or assign a block number to the page, or mark the page copy_on_write, or whatever else it wants. Or allocate bounce buffers, for high memory pages. . . Though I don't think copy_on_write and (whatever else it wants) are necessarily hot ideas. There need to be some constraints. Do the rest of you agree? I agree that it is a good idea. Exactly how it should, and if it should be implemented for 2.3 is a second question. There is also the related issue of the page lock not currently working for NFS (it needs some I/O locking). Since I've been pushing for this for a while. I'll see if there is anything that could be considered a ``bug fix'' doable yet in 2.3 The recent patch to improve fdatasync performance is also somewhat related. Eric
Re: [patch] [possible race in ext2] Re: how to write get_block?
"Stephen C. Tweedie" [EMAIL PROTECTED] writes: Hi, On Sun, 10 Oct 1999 16:57:18 +0200 (CEST), Andrea Arcangeli [EMAIL PROTECTED] said: My point was that even being forced to do a lookup before creating each empty buffer, will be still faster than 2.2.x as in 2.3.x the hash will contain only metadata. Less elements means faster lookups. The _fast_ quick fix is to maintain a per-inode list of dirty buffers and to invalidate that list when we do a delete. This works for directories if we only support truncate back to zero --- it obviously gets things wrong if we allow partial truncates of directories (but why would anyone want to allow that?!) This would have minimal performance implication and would also allow fast fsync() of indirect block metadata for regular files. What about adding to the end of ext2_alloc_block: bh = get_hash_table(inode-i_dev, result, inode-i_sb-s_blocksize); /* something is playing with our fresh block, make them stop. ;-) */ if (bh) { if (buffer_dirty(bh)) { mark_buffer_clean(bh); wait_on_buffer(bh); } bforget(bh); } This is in a relatively slow/uncommon path and also catches races if indirect blocks are allocated to a file, instead of just directory syncs. Ultimately we really want to have indirect blocks, and the directory in the page cache as it should result in more uniform code, and faster partial truncates (as well as faster syncs). Eric