Re: Multiple devfs mounts

2000-05-01 Thread Eric W. Biederman

Richard Gooch [EMAIL PROTECTED] writes:

   Hi, Al. You've previously stated that you consider the multiple
 mount feature of devfs broken. I agree that there are some races in
 there. However, I'm not clear on whether you're saying that the entire
 concept is broken, or that it can be fixed with appropriate loffcking.
 I've asked this before, but haven't had a response.

Last I saw it was his complaint that you varied what you
showed at different mount points, and that doing that all in 
one dcache tree was fundamentally broken.

 
 If you feel that it's fundamentally impossible to mount a FS multiple
 times, please explain your reasoning.

At this point it would make sense to just use the generic multiple
mount features in the VFS that Alexander has been putting in.

Eric



Re: [Fwd: [linux-audio-dev] info point on linux hdr]

2000-04-14 Thread Eric W. Biederman

Andrew Clausen [EMAIL PROTECTED] writes:

 Hi all,
 
 Any comments?From: Paul Barton-Davis [EMAIL PROTECTED]
 Subject: [linux-audio-dev] info point on linux hdr
 To: [EMAIL PROTECTED]
 Date: Fri Apr 14 07:10:10 2000 -0500
 
 i mentioned in some remarks to benno how important i thought it was to
 preallocate the files used for hard disk recording under linux.
 
 i was doing more work on ardour yesterday, and had the occasion to
 create some new "tapes", of lengths from 2 to 40 minutes. 
 
 the simple act of creating 24 5 minute WAV files on ext2, files in
 which every block has been allocated, takes at least 1 minute, perhaps
 as much as 2. i create the files by first opening them, then writing a
 zero byte every blocksize bytes (blocksize comes from stat(2)), except
 for the WAV header at the beginning.

I'm confused.  You wrote what is conceptually 24x5=120 minutes of audio
data in 2 minutes.  So you have 60 times the performance you need.
What is the problem?

 i hadn't done this in a while, but it reminded me of the
 near-impossibility of expecting the FS code to allocate this much disk
 space in real time under ext2. If someone knows of a faster way to
 allocate the disk space, please let me know. Its not that I have a
 problem right now, just wanted to point out this behaviour.

If you blocksize was 1k increasing it to 4k could help.

Eric



Re: Ext2 block size

2000-03-05 Thread Eric W. Biederman

Brian Pomerantz [EMAIL PROTECTED] writes:

[snip]

 
 Unfortunately, the smallest stripe size I could get on the Mylex card
 is 8KB.  When I tried that size, I was getting half the write
 throughput compared to a 64KB stripe size.  Also, to really have
 optimal performance I need to use at least an 8KB block size (to match
 the Alpha page size).  

Why does matching the the alpha page size give you better performance?

 After looking at many of the current hard
 drives, I noticed that the physical block size is variable depending
 on where you are on the platter.  Almost all of the drives I looked up
 had a maximum transfer size of 64KB for a single transaction.  I'm
 wondering if making the stripe size equal this transfer size * data
 drives would actually be the optimal configuration.

More likely setting filesystem read-ahead to
transfer-size * data drives would give you the performance you arre
looking for.

 
 I think the consensus here at LLNL is that for the scale that we are
 working with, a larger block size is better.  
You are dealing with very large files?

 We have also found that
 if you can match the stripe size to the block size in a RAID 5, your
 write performance will be at least twice as good compared to not
 matching these up.  On the ASCI Blue system, we saw a factor of 10
 performance increase on the RAID 5 systems when we matched up the
 block size to the stripe size.  The reason behind this is you don't
 have to worry about the RAID controller doing a read-modify-write of a
 stripe because your block is the same size as the stripe.

I seem to be seeing a strong bias to bigger is better in the way
your are looking at things.  And certainly it works that way when
you are looking at benchmarks, and the performance numbers.  

However bigger is better generally breaks down when you look at the
sizes of files you need to store.  Most files tend to be small causing
large block sizes to be very inefficient in terms of space.

The distribution from my pc can be seen below.  Where it is clear
that with a 8K block size 76% of the blocks would be less than
half full.

Also matching strip size to block size sounds reasonable, but don't
forget that for a small file on a large block the read/modify/write
cycle must happen.  Either internal to the controller or externally
in the operating system caches.  So generally keeping the block size
small to avoid read/modify/write is a win.

To optimize space/performance a ext2 typically does reads/writes of
many small contiguous blocks all at one time.  Which on a typical disk
yields performance as good as large block sizes without the extra size
penalty.

Eric

   0 - 511  15.76% 47773
512 - 1023  16.70% 50622
   1024 - 2047  24.45% 74109
   2048 - 4095  19.26% 58379
   4096 - 8191  10.35% 31364
  8192 - 16383   6.66% 20179
 16384 - 32767   3.74% 11347
 32768 - 65535   1.73% 5255
65536 - 131071   0.73% 2212
   131072 - 262143   0.33% 986
   262144 - 524287   0.14% 437
  524288 - 1048575   0.07% 217
 1048576 - 2097151   0.03% 88
 2097152 - 4194303   0.02% 46
 4194304 - 8388607   0.01% 22
8388608 - 16777215   0.00% 8
   16777216 - 33554431   0.00% 2
   33554432 - 67108863   0.00% 1
   ---
   303047



Re: COW snapshotting

1999-11-10 Thread Eric W. Biederman

Rik van Riel [EMAIL PROTECTED] writes:


 I know this could involve an extra copy of the page for
 writing, but it fits in really well with the transactioning
 scheme and could, with the proper API, give us a way to
 implement a high-performance transactioning interface for
 userspace. 

The user space side of this idea seems completely silly.
Plus user space has more issues to contend with.

For filesystem metadata journally is used exclusively to 
be able to recover a consistent filesystem after an unexpected
software disappearance (power outage, kernel crash etc).

For user space transactions (i.e. what a user space program
would use in a database) are also used to sort out
multiple processes that perform simultaneous actions
without locking.

Consider the following points:
(a) Transactions usually succeed.
(b) Transactions can be huge and involve multiple files.
(c) Commits happen after the end of every transaction,
and their length affects the latency of operations.

In a large transaction, especially one that creates
new data (so the rollback logs can be very small),
you want the data to go to disk like a normal file write.

Then when the commit eventually happens you hardly need
to wait at all, and only the rollback log needs to be 
cleaned up.

Consider loading a gigabyte of data into your database
in one transaction.

Equally there is the question of what happens if you
have two records on the same page, being updated
in different transactions.  Your scheme would
either force one transaction to wait for the other, or 
force the transactions to merge.  Not nice.

I won't argue that building a nice user space API isn't
a good idea.  But we need to get fs level journalling into 
the kernel first, and we need to think about things more carefully.

This isn't like implementing vfork where the question was 
what is the best way to stop the parent process, because 
clone could already cleanly handle the tricky address space
shareing.

Eric



Re: Location of shmfs?

1999-10-30 Thread Eric W. Biederman

Jeff Garzik [EMAIL PROTECTED] writes:

 Does anyone know where I can find code for shmfs?

Try:
http://www.users.uswest.net/~ebiederm/files/

Sorry.
I'm in the middle of switching ISP. . .

Eric



Re: Location of shmfs?

1999-10-30 Thread Eric W. Biederman

[EMAIL PROTECTED] (Eric W. Biederman) writes:

 Jeff Garzik [EMAIL PROTECTED] writes:
 
  Does anyone know where I can find code for shmfs?
 
 Try:
 http://www.users.uswest.net/~ebiederm/files/
Make that:
http://www.users.uswest.net/~ebiederman/files/

They had to put in my full last name grr.

Eric



Re: (reiserfs) Re: RE: journal requirements for buffer.c (was: Roma progress report)

1999-10-11 Thread Eric W. Biederman

Hans Reiser [EMAIL PROTECTED] writes:

 I feel we should encourage Linus to allow the following:
 
 * unions in struct buffer_head and struct page containing filesystem specific
 fields comparable to the union in struct inode.

No.  

In struct buffer_head I don't have problems.

In struct page (you don't want to pay the overhead for every page of memory)...
There is the buffer_head *bh pointer that could be made more generic.

 
 * a filesystem operation which flushes to disk a buffer or page, and leaves it
 to the filesystem whether to flush a whole commit along with it, or a whole
 bunch of semantically adjacent buffers, or to repack the buffers before writing
 them, or to write a whole bunch of semantically nearby buffers, or to fill the
 rest of the RAID stripe, or assign a block number to the page, or mark the page
 copy_on_write, or whatever else it wants.

Or allocate bounce buffers, for high memory pages. . .

Though I don't think copy_on_write and (whatever else it wants)
are necessarily hot ideas.  There need to be some constraints.

 Do the rest of you agree?

I agree that it is a good idea. Exactly how it should, and if it
should be implemented for 2.3 is a second question.

There is also the related issue of the page lock not currently working
for NFS (it needs some I/O locking).

Since I've been pushing for this for a while.  I'll see if there
is anything that could be considered a ``bug fix'' doable yet in 2.3

The recent patch to improve fdatasync performance is also somewhat
related.

Eric



Re: [patch] [possible race in ext2] Re: how to write get_block?

1999-10-11 Thread Eric W. Biederman

"Stephen C. Tweedie" [EMAIL PROTECTED] writes:

 Hi,
 
 On Sun, 10 Oct 1999 16:57:18 +0200 (CEST), Andrea Arcangeli
 [EMAIL PROTECTED] said:
 
  My point was that even being forced to do a lookup before creating
  each empty buffer, will be still faster than 2.2.x as in 2.3.x the hash
  will contain only metadata. Less elements means faster lookups.
 
 The _fast_ quick fix is to maintain a per-inode list of dirty buffers
 and to invalidate that list when we do a delete.  This works for
 directories if we only support truncate back to zero --- it obviously
 gets things wrong if we allow partial truncates of directories (but why
 would anyone want to allow that?!)
 
 This would have minimal performance implication and would also allow
 fast fsync() of indirect block metadata for regular files.

What about adding to the end of ext2_alloc_block:

bh = get_hash_table(inode-i_dev, result, inode-i_sb-s_blocksize);
/* something is playing with our fresh block, make them stop.  ;-) */
if (bh) {
if (buffer_dirty(bh)) { 
mark_buffer_clean(bh);
wait_on_buffer(bh);
}
bforget(bh);
}

This is in a relatively slow/uncommon path and also catches
races if indirect blocks are allocated to a file, instead
of just directory syncs.

Ultimately we really want to have indirect blocks, and
the directory in the page cache as it should result in
more uniform code, and faster partial truncates (as well as faster
syncs).

Eric