Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Ric Wheeler

Jeff Garzik wrote:

Jamie Lokier wrote:

By durable, I mean that fsync() should actually commit writes to
physical stable storage,


Yes, it should.



I was surprised that fsync() doesn't do this already.  There was a lot
of effort put into block I/O write barriers during 2.5, so that
journalling filesystems can force correct write ordering, using disk
flush cache commands.

After all that effort, I was very surprised to notice that Linux 2.6.x
doesn't use that capability to ensure fsync() flushes the disk cache
onto stable storage.


It's surprising you are surprised, given that this [lame] fsync behavior 
has remaining consistently lame throughout Linux's history.


Maybe I am confused, but isn't this is what fsync() does today whenever 
barriers are enabled (the fsync() invalidates the drive's write cache).


ric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Incremental fsck

2008-01-14 Thread Ric Wheeler

Pavel Machek wrote:

On Sat 2008-01-12 09:51:40, Theodore Tso wrote:

On Wed, Jan 09, 2008 at 02:52:14PM +0300, Al Boldi wrote:

Ok, but let's look at this a bit more opportunistic / optimistic.

Even after a black-out shutdown, the corruption is pretty minimal, using 
ext3fs at least.



After a unclean shutdown, assuming you have decent hardware that
doesn't lie about when blocks hit iron oxide, you shouldn't have any
corruption at all.  If you have crappy hardware, then all bets are off


What hardware is crappy here. Lets say... internal hdd in thinkpad
x60?

What are ext3 expectations of disk (is there doc somewhere)? For
example... if disk does not lie, but powerfail during write damages
the sector -- is ext3 still going to work properly?

If disk does not lie, but powerfail during write may cause random
numbers to be returned on read -- can fsck handle that?

What abou disk that kills 5 sectors around sector being written during
powerfail; can ext3 survive that?

Pavel



I think that you have to keep in mind the way disk (and other media) 
fail. You can get media failures after a successful write or errors that 
pop up as the media ages.


Not to mention the way most people run with write cache enabled and no 
write barriers enabled - a sure recipe for corruption.


Of course, there are always software errors to introduce corruption even 
when we get everything else right ;-)


From what I see, media errors are the number one cause of corruption in 
file systems. It is critical that fsck (and any other tools) continue 
after an IO error since they are fairly common (just assume that sector 
is lost and do your best as you continue on).


ric

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/1] Drop CAP_SYS_RAWIO requirement for FIBMAP

2007-11-01 Thread Ric Wheeler


Pavel Machek wrote:

Hi!


Remove the need for having CAP_SYS_RAWIO when doing a FIBMAP call on an open 
file descriptor.

It would be nice to allow users to have permission to see where their data is 
landing on disk, and there really isn't a good reason to keep them from getting 
at this information.


I believe it is to prevent users from intentionally creating extremely
fragmented files...

You can read 60MB in a second, but fragmented 60MB file could take
10msec * 60MB/4KB = 150 seconds. That's factor 150 slowdown...

...but I agree that SYS_RAWIO may be wrong capability to cover this.

Pavel


I don't see how restricting FIBMAP use helps prevent fragmentation since FIBMAP 
just allows you to see what damage was already done.


You can create nicely fragmented files simply by having multiple threads writing 
concurrently to one or more files in the same directory (depending on the file 
system, allocation policy, etc).


ric

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-31 Thread Ric Wheeler

Zach Brown wrote:

The second use case is to look at the physical layout of blocks on disk
for a specific file, use Mark Lord's write_long patches to inject a disk
error and then read that file to make sure that we are handling disk IO
errors correctly.  A bit obscure, but really quite useful.


Hmm, yeah, that's interesting.


It would be even better if we could poke holes in metadata, etc, but 
this gives us a reasonable test case.





We have also used FIBMAP a few times to try and map an observed IO error
back to a file. Really slow and painful to do, but should work on any
file system when a better method is not supported.


We're getting off of this FIBMAP topic, but this interests me.  Can we
explore this a little?  How did you find out about the error without
having a file to associate with it?  Drive scrubbing, or some such?

- z


Vladimir extended debugreiserfs to do an optimized reverse mapping scan 
of the disk sector to file/metadata/etc. Definitely worth having that 
ability for any file system.


We also do drive scrubbing, looking for bad sectors. The list of those 
sectors is fed into the reverse mapping code to enable us to gage the 
impact of the IO errors, start recovering the user files, etc.


The scrub code we use takes advantage of the read-verify command (to 
avoid data transfer from the drive to the page cache).


ric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-31 Thread Ric Wheeler

Zach Brown wrote:

Can you clarify what you mean above with an example?  I don't really
follow.


Sure, take 'tar' as an example.  It'll read files in the order that
their names are returned from directory listing.  This can produce bad
IO patterns because the order in which the file names are returned
doesn't match the order of the file's blocks on disk.  (htree, I'm
looking at you!)

People have noticed that tar-like loads can be sped up greatly just by
sorting the files by their inode number as returned by stat(), never
mind the file blocks themselves.  One example of this is Chris Mason's
'acp'.

  http://oss.oracle.com/~mason/acp/

The logical extension of that is to use FIBMAP to find the order of file
blocks on disk and then doing IO on blocks in sorted order.  It'd take
work to write an app that does this reliably, sure.

In this use the application doesn't actually care what the absolute
numbers are.  It cares about their ordering.  File systems would be able
to chose whatever scheme they wanted for the actual values of the
results from a FIBMAP-alike as long as the sorting resulted in the right
IO patterns.

Arguing that this use is significant enough to justify an addition to
the file system API is a stretch.  I'm just sharing the observation.

- z


I use FIBMAP support for a few different things.

The first is to exactly the case that you describe above where we can 
use the first block of a file extracted by FIBMAP to produce an optimal 
sorting for the read order.  My testing showed that the cost of the 
extra fibmap was not too high compared to the speedup, but it was not a 
huge gain over the speedup gained when the read was done in inode sorted 
order.


The second use case is to look at the physical layout of blocks on disk 
for a specific file, use Mark Lord's write_long patches to inject a disk 
error and then read that file to make sure that we are handling disk IO 
errors correctly.  A bit obscure, but really quite useful.


We have also used FIBMAP a few times to try and map an observed IO error 
back to a file. Really slow and painful to do, but should work on any 
file system when a better method is not supported.



ric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: batching support for transactions

2007-10-03 Thread Ric Wheeler

Andreas Dilger wrote:

On Oct 03, 2007  06:42 -0400, Ric Wheeler wrote:
With 2 threads writing to the same directory, we instantly drop down to 
234 files/sec.

Is this with HZ=250?
Yes - I assume that with HZ=1000 the batching would start to work again 
since the penalty for batching would only be 1ms which would add a 0.3ms 
overhead while waiting for some other thread to join.


This is probably the easiest solution, but at the same time using HZ=1000
adds overhead to the server because of extra interrupts, etc.


We will do some testing with this in the next day or so.


It would seem one of the problems is that we shouldn't really be
scheduling for a fixed 1 jiffie timeout, but rather only until the
other threads have a chance to run and join the existing transaction.
This is really very similar to the domain of the IO schedulers - when do 
you hold off an IO and/or try to combine it.


I was thinking the same.

my guess would be that yield() doesn't block the first thread long enough
for the second one to get into the transaction (e.g. on an 2-CPU system
with 2 threads, yield() will likely do nothing).
Andy tried playing with yield() and it did not do well. Note this this 
server is a dual CPU box, so your intuition is most likely correct.


How many threads did you try?


Andy's tested 1, 2, 4, 8, 20 and 40 threads.  Once we review the test
and his patch, we can post the summary data.


It makes sense to track not only the time to commit a single synchronous
transaction, but also the time between sync transactions to decide if
the initial transaction should be held to allow later ones.
Yes, that is what I was trying to suggest with the rate. Even if we are 
relatively slow, if the IO's are being synched at a low rate, we are 
effectively adding a potentially nasty latency for each IO.


That would give us two measurements to track per IO device - average 
commit time and this average IO's/sec rate. That seems very doable.


Agreed.


This would also seem to be code that would be good to share between all
of the file systems for their transaction bundling.


Alternately, it might be possible to check if a new thread is trying to
start a sync handle when the previous one was also synchronous and had
only a single handle in it, then automatically enable the delay in that 
case.
I am not sure that this avoids the problem with the current defaults at 
250HZ where each wait is sufficient to do 3 fully independent 
transactions ;-)


I was trying to think if there was some way to non-busy-wait that is
less than 1 jiffie.


One other technique would be to use async IO, which could push the 
batching of the fsync's up to application space.  For example, send down 
a sequence of "async fsync" requests for a series of files and then poll 
for completion once you have launched them.


ric


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: batching support for transactions

2007-10-03 Thread Ric Wheeler

Andreas Dilger wrote:

On Oct 02, 2007  08:57 -0400, Ric Wheeler wrote:
One thing that jumps out is that the way we currently batch synchronous 
work loads into transactions does really horrible things to performance 
for storage devices which have really low latency.


For example, one a mid-range clariion box, we can use a single thread to 
write around 750 (10240 byte) files/sec to a single directory in ext3. 
That gives us an average time around 1.3ms per file.


With 2 threads writing to the same directory, we instantly drop down to 
234 files/sec.


Is this with HZ=250?


Yes - I assume that with HZ=1000 the batching would start to work again 
since the penalty for batching would only be 1ms which would add a 0.3ms 
overhead while waiting for some other thread to join.





The culprit seems to be the assumptions in journal_stop() which throw in 
a call to schedule_timeout_uninterruptible(1):


pid = current->pid;
if (handle->h_sync && journal->j_last_sync_writer != pid) {
journal->j_last_sync_writer = pid;
do {
old_handle_count = transaction->t_handle_count;
schedule_timeout_uninterruptible(1);
} while (old_handle_count != transaction->t_handle_count);
}


It would seem one of the problems is that we shouldn't really be
scheduling for a fixed 1 jiffie timeout, but rather only until the
other threads have a chance to run and join the existing transaction.


This is really very similar to the domain of the IO schedulers - when do 
you hold off an IO and/or try to combine it.


It is hard to predict the future need of threads that will be wanting to 
do IO, but you can dynamically measure the average time it takes a 
transaction to commit.


Would it work to keep this average commit time is less than say 80% of 
the timeout?  Using the 1000HZ example, 1ms wait for the average commit 
time of 1.2 or 1.3 ms?




What seems to be needed here is either a static per file system/storage 
device tunable to allow us to change this timeout (maybe with "0" 
defaulting back to the old reiserfs trick of simply doing a yield()?)


Tunables are to be avoided if possible, since they will usually not be
set except by the .1% of people who actually understand them.  Using
yield() seems like the right thing, but Andrew Morton added this code and
my guess would be that yield() doesn't block the first thread long enough
for the second one to get into the transaction (e.g. on an 2-CPU system
with 2 threads, yield() will likely do nothing).


I agree that tunables are a bad thing.  It might be nice to dream about 
having mkfs do some test timings (issues and time the average 
synchronous IOs/sec) and setting this in the superblock.


Andy tried playing with yield() and it did not do well. Note this this 
server is a dual CPU box, so your intuition is most likely correct.


The balance is that the batching does work well for "normal" slow disks, 
especially when using the write barriers (giving us an average commit 
time closer to 20ms).


or a more dynamic, per device way to keep track of the average time it 
takes to commit a transaction to disk. Based on that rate, we could 
dynamically adjust our logic to account for lower latency devices.


It makes sense to track not only the time to commit a single synchronous
transaction, but also the time between sync transactions to decide if
the initial transaction should be held to allow later ones.


Yes, that is what I was trying to suggest with the rate. Even if we are 
relatively slow, if the IO's are being synched at a low rate, we are 
effectively adding a potentially nasty latency for each IO.


That would give us two measurements to track per IO device - average 
commit time and this average IO's/sec rate. That seems very doable.



Alternately, it might be possible to check if a new thread is trying to
start a sync handle when the previous one was also synchronous and had
only a single handle in it, then automatically enable the delay in that case.


I am not sure that this avoids the problem with the current defaults at 
250HZ where each wait is sufficient to do 3 fully independent 
transactions ;-)


ric

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


batching support for transactions

2007-10-02 Thread Ric Wheeler


After several years of helping tune file systems for normal (ATA/S-ATA) 
drives, we have been doing some performance work on ext3 & reiserfs on 
disk arrays.


One thing that jumps out is that the way we currently batch synchronous 
work loads into transactions does really horrible things to performance 
for storage devices which have really low latency.


For example, one a mid-range clariion box, we can use a single thread to 
write around 750 (10240 byte) files/sec to a single directory in ext3. 
That gives us an average time around 1.3ms per file.


With 2 threads writing to the same directory, we instantly drop down to 
234 files/sec.


The culprit seems to be the assumptions in journal_stop() which throw in 
a call to schedule_timeout_uninterruptible(1):


/*
 * Implement synchronous transaction batching.  If the handle
 * was synchronous, don't force a commit immediately.  Let's
 * yield and let another thread piggyback onto this transaction.
 * Keep doing that while new threads continue to arrive.
 * It doesn't cost much - we're about to run a commit and sleep
 * on IO anyway.  Speeds up many-threaded, many-dir operations
 * by 30x or more...
 *
 * But don't do this if this process was the most recent one to
 * perform a synchronous write.  We do this to detect the case 
where a
 * single process is doing a stream of sync writes.  No point 
in waiting

 * for joiners in that case.
 */
pid = current->pid;
if (handle->h_sync && journal->j_last_sync_writer != pid) {
journal->j_last_sync_writer = pid;
do {
old_handle_count = transaction->t_handle_count;
schedule_timeout_uninterruptible(1);
} while (old_handle_count != transaction->t_handle_count);
}


reiserfs and ext4 have similar if not exactly the same logic.

What seems to be needed here is either a static per file system/storage 
device tunable to allow us to change this timeout (maybe with "0" 
defaulting back to the old reiserfs trick of simply doing a yield()?) or 
a more dynamic, per device way to keep track of the average time it 
takes to commit a transaction to disk. Based on that rate, we could 
dynamically adjust our logic to account for lower latency devices.


A couple of last thoughts. One, if for some reason you don't have a low 
latency storage array handy and want to test this for yourselves, you 
can test the worst case by using a ram disk.


The test we used was fs_mark with 10240 bytes files, writing to one 
shared directory with varying the numbers of threads from 1 up to 40. In 
the ext3 case, it takes 8 concurrent threads to catch up to the single 
thread writing case.


We are continuing to play with the code and try out some ideas, but I 
wanted to bounce this off the broader list to see if this makes sense...


ric


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-13 Thread Ric Wheeler



Guy Watkins wrote:

} -Original Message-
} From: [EMAIL PROTECTED] [mailto:linux-raid-
} [EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED]
} Sent: Thursday, July 12, 2007 1:35 PM
} To: [EMAIL PROTECTED]
} Cc: Tejun Heo; [EMAIL PROTECTED]; Stefan Bader; Phillip Susi; device-mapper
} development; linux-fsdevel@vger.kernel.org; [EMAIL PROTECTED];
} [EMAIL PROTECTED]; Jens Axboe; David Chinner; Andreas Dilger
} Subject: Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for
} devices, filesystems, and dm/md.
} 
} On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said:

} > [EMAIL PROTECTED] wrote:
} > > On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
} > >
} > >> All of the high end arrays have non-volatile cache (read, on power
} loss, it is a
} > >> promise that it will get all of your data out to permanent storage).
} You don't
} > >> need to ask this kind of array to drain the cache. In fact, it might
} just ignore
} > >> you if you send it that kind of request ;-)
} > >
} > > OK, I'll bite - how does the kernel know whether the other end of that
} > > fiberchannel cable is attached to a DMX-3 or to some no-name product
} that
} > > may not have the same assurances?  Is there a "I'm a high-end array"
} bit
} > > in the sense data that I'm unaware of?
} > >
} >
} > There are ways to query devices (think of hdparm -I in S-ATA/P-ATA
} drives, SCSI
} > has similar queries) to see what kind of device you are talking to. I am
} not
} > sure it is worth the trouble to do any automatic detection/handling of
} this.
} >
} > In this specific case, it is more a case of when you attach a high end
} (or
} > mid-tier) device to a server, you should configure it without barriers
} for its
} > exported LUNs.
} 
} I don't have a problem with the sysadmin *telling* the system "the other

} end of
} that fiber cable has characteristics X, Y and Z".  What worried me was
} that it
} looked like conflating "device reported writeback cache" with "device
} actually
} has enough battery/hamster/whatever backup to flush everything on a power
} loss".
} (My back-of-envelope calculation shows for a worst-case of needing a 1ms
} seek
} for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync.
} That's
} a lot of battery..)

Most hardware RAID devices I know of use the battery to save the cache while
the power is off.  When the power is restored it flushes the cache to disk.
If the power failure lasts longer than the batteries then the cache data is
lost, but the batteries last 24+ hours I beleve.


Most mid-range and high end arrays actually use that battery to insure that data 
is all written out to permanent media when the power is lost. I won't go into 
how that is done, but it clearly would not be a safe assumption to assume that 
your power outage is only going to be a certain length of time (and if not, you 
would lose data).




A big EMC array we had had enough battery power to power about 400 disks
while the 16 Gig of cache was flushed.  I think EMC told me the batteries
would last about 20 minutes.  I don't recall if the array was usable during
the 20 minutes.  We never tested a power failure.

Guy


I worked on the team that designed that big array.

At one point, we had an array on loan to a partner who tried to put it in a very 
small data center. A few weeks later, they brought in an electrician who needed 
to run more power into the center.  It was pretty funny - he tried to find a 
power button to turn it off and then just walked over and dropped power trying 
to get the Symm to turn off.  When that didn't work, he was really, really 
confused ;-)


ric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-12 Thread Ric Wheeler



[EMAIL PROTECTED] wrote:

On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said:

[EMAIL PROTECTED] wrote:

On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:

All of the high end arrays have non-volatile cache (read, on power loss, it is a 
promise that it will get all of your data out to permanent storage). You don't 
need to ask this kind of array to drain the cache. In fact, it might just ignore 
you if you send it that kind of request ;-)

OK, I'll bite - how does the kernel know whether the other end of that
fiberchannel cable is attached to a DMX-3 or to some no-name product that
may not have the same assurances?  Is there a "I'm a high-end array" bit
in the sense data that I'm unaware of?

There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI 
has similar queries) to see what kind of device you are talking to. I am not

sure it is worth the trouble to do any automatic detection/handling of this.

In this specific case, it is more a case of when you attach a high end (or 
mid-tier) device to a server, you should configure it without barriers for its

exported LUNs.


I don't have a problem with the sysadmin *telling* the system "the other end of
that fiber cable has characteristics X, Y and Z".  What worried me was that it
looked like conflating "device reported writeback cache" with "device actually
has enough battery/hamster/whatever backup to flush everything on a power loss".
(My back-of-envelope calculation shows for a worst-case of needing a 1ms seek
for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync.  That's
a lot of battery..)


I think that we are on the same page here - just let the sys admin mount without 
barriers for big arrays.


1GB of cache, by the way, is really small for some of us ;-)

ric

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-11 Thread Ric Wheeler


[EMAIL PROTECTED] wrote:

On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:

All of the high end arrays have non-volatile cache (read, on power loss, it is a 
promise that it will get all of your data out to permanent storage). You don't 
need to ask this kind of array to drain the cache. In fact, it might just ignore 
you if you send it that kind of request ;-)


OK, I'll bite - how does the kernel know whether the other end of that
fiberchannel cable is attached to a DMX-3 or to some no-name product that
may not have the same assurances?  Is there a "I'm a high-end array" bit
in the sense data that I'm unaware of?



There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI 
has similar queries) to see what kind of device you are talking to. I am not 
sure it is worth the trouble to do any automatic detection/handling of this.


In this specific case, it is more a case of when you attach a high end (or 
mid-tier) device to a server, you should configure it without barriers for its 
exported LUNs.


ric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Ric Wheeler



Tejun Heo wrote:

[ cc'ing Ric Wheeler for storage array thingie.  Hi, whole thread is at
http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/3344 ]


I am actually on the list, just really, really far behind in the thread ;-)



Hello,

[EMAIL PROTECTED] wrote:

but when you consider the self-contained disk arrays it's an entirely
different story. you can easily have a few gig of cache and a complete
OS pretending to be a single drive as far as you are concerned.

and the price of such devices is plummeting (in large part thanks to
Linux moving into this space), you can now readily buy a 10TB array for
$10k that looks like a single drive.


Don't those thingies usually have NV cache or backed by battery such
that ORDERED_DRAIN is enough?


All of the high end arrays have non-volatile cache (read, on power loss, it is a 
promise that it will get all of your data out to permanent storage). You don't 
need to ask this kind of array to drain the cache. In fact, it might just ignore 
you if you send it that kind of request ;-)


The size of the NV cache can run from a few gigabytes up to hundreds of 
gigabytes, so you really don't want to invoke cache flushes here if you can 
avoid it.


For this class of device, you can get the required in order completion and data 
integrity semantics as long as we send the IO's to the device in the correct order.




The problem is that the interface between the host and a storage device
(ATA or SCSI) is not built to communicate that kind of information
(grouped flush, relaxed ordering...).  I think battery backed
ORDERED_DRAIN combined with fine-grained host queue flush would be
pretty good.  It doesn't require some fancy new interface which isn't
gonna be used widely anyway and can achieve most of performance gain if
the storage plays it smart.

Thanks.



I am not really sure that you need this ORDERED_DRAIN for big arrays...

ric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Testing framework

2007-04-23 Thread Ric Wheeler

Avishay Traeger wrote:

On Mon, 2007-04-23 at 02:16 +0530, Karuna sagar K wrote:

For some time I had been working on this file system test framework.
Now I have a implementation for the same and below is the explanation.
Any comments are welcome.




You may want to check out the paper "EXPLODE: A Lightweight, General
System for Finding Serious Storage System Errors" from OSDI 2006 (if you
haven't already).  The idea sounds very similar to me, although I
haven't read all the details of your proposal.

Avishay



It would also be interesting to use the disk error injection patches 
that Mark Lord sent out recently to introduce real sector level 
corruption.  When your file systems are large enough and old enough, 
getting bad sectors and IO errors during an fsck stresses things in 
interesting ways ;-)


ric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Linux 2007 File System & IO Workshop notes & talks

2007-04-10 Thread Ric Wheeler


We have some of the material reviewed and posted now from the IO & FS 
workshop.


USENIX has posted the talks at:

http://www.usenix.org/events/lsf07/tech/tech.html

A write up of the workshop went out at LWN and invoked a healthy discussion:

http://lwn.net/Articles/226351/

At that LWN article, there is a link to the Linux FS wiki with good notes:

http://linuxfs.pbwiki.com/LSF07-Workshop-Notes

Another summary will go out in the next USENIX ;login edition.

ric

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: impact of 4k sector size on the IO & FS stack

2007-03-12 Thread Ric Wheeler

Alan Cox wrote:
First generation of 1K sector drives will continue to use the same 
512-byte ATA sector size you are familiar with.  A single 512-byte write 
will cause the drive to perform a read-modify-write cycle.  This 
configuration is physical 1K sector, logical 512b sector.


The problem case is "read-modify-screwup"

At that point we've trashed the block we were writing (a well studied
recovery case), and we've blasted some previously sane, totally
unrelated sector of data out of existance. Thats why we need to know
ideally if they are doing the write to a different physical block when
they do this, so that we don't lose the old data. My guess is they won't
as it'll be hard.


I think that the firmware would have to do this in the drive's write 
cache and would always write the modified data back to the same physical 
sector (unless a media error forces a sector remap).


If firmware modifies the 7 512 byte sectors that it read to do the 1 512 
byte sector write, then we certainly would see what you describe happen.


In general, it would seem to be a bad idea to do allocate a different 
physical sector to underpin this king of read-modify-write since that 
would kill contiguous layout of files, etc.


A future configuration will change the logical ATA interface away from 
512-byte sectors to 1K or 4K.  Here, it is impossible to read a quantity 
smaller than 1K or 4K, whatever the sector size is.


That one I'm not worried about - other than "guess how Redmond decide to
make partition tables work" that one is mostly easy (be fun to see how
many controllers simply can't cope with the command formats)



This will be interesting to find out. I will be sharing a panel with 
some BIOS & MS people, so I will update all on what I hear,


ric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: impact of 4k sector size on the IO & FS stack

2007-03-11 Thread Ric Wheeler



Jan Engelhardt wrote:

On Mar 11 2007 18:51, Ric Wheeler wrote:
  

During the recent IO/FS workshop, we spoke briefly about the
coming change to a 4k sector size for disks on linux. If I
recall correctly, the general feeling was that the impact was
not significant since we already do most file system IO in 4k
page sizes and should be fine as long as we partition drives
correctly and avoid non-4k aligned partitions.



Sorry about jumping right in, but what about an 'old-style'
partition table that relies on 512 as a unit?

  
I think that the normal case would involve new drives which would need 
to be partitioned in 4k aligned partitions. Shouldn't that work 
regardless of the unit used in the partition table?



ric



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: impact of 4k sector size on the IO & FS stack

2007-03-11 Thread Ric Wheeler


Alan Cox wrote:
Are there other concerns in the IO or FS stack that we should bring up 
with vendors?  I have been asked to summarize the impact of 4k sectors 
on linux  for a disk vendor gathering and want to make sure that I put 
all of our linux specific items into that summary...



We need to make sure the physical sector size is correctly reported by
the disk (eg in the ATA7 identify data) but I think for libata at least
the right bits are already there and we've got a fair amount of scsi disk
experience with other media sizes (eg 2K) already. 256byte/sector media
is still broken btw 8)
  
It would be really interesting to see if we can validate this with 
prototype drives.

I would be interested to know what the disk vendors intend to use as
their strategy when (with ATA) they have a 512 byte write from an older
file system/setup into a 4K block. The case where errors magically appear
in other parts of the fs when such an error occurs are not IMHO too well
considered.

Alan
As Jeff mentioned, I think that they would have to do a 
read-modify-write simulation which would kill performance for a small, 
random write work load...


ric

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: impact of 4k sector size on the IO & FS stack

2007-03-11 Thread Ric Wheeler



Jeff Garzik wrote:

Alan Cox wrote:

I would be interested to know what the disk vendors intend to use as
their strategy when (with ATA) they have a 512 byte write from an older
file system/setup into a 4K block. The case where errors magically 
appear


Well, you have logical and physical sector size changes.

First generation of 1K sector drives will continue to use the same 
512-byte ATA sector size you are familiar with.  A single 512-byte 
write will cause the drive to perform a read-modify-write cycle.  This 
configuration is physical 1K sector, logical 512b sector.
It would seem that most writes would avoid this - hopefully the drive 
firmware could use the write cache to coalesce contiguous IO's into 1k 
multiples when getting streams of 512 byte write requests.


A future configuration will change the logical ATA interface away from 
512-byte sectors to 1K or 4K.  Here, it is impossible to read a 
quantity smaller than 1K or 4K, whatever the sector size is.


Jeff
I will try and see if I can get some specific information on when the 
various flavors of this are going to appear...


ric

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


impact of 4k sector size on the IO & FS stack

2007-03-11 Thread Ric Wheeler


During the recent IO/FS workshop, we spoke briefly about the coming 
change to a 4k sector size for disks on linux. If I recall correctly, 
the general feeling was that the impact was not significant since we 
already do most file system IO in 4k page sizes and should be fine as 
long as we partition drives correctly and avoid non-4k aligned partitions.


Are there other concerns in the IO or FS stack that we should bring up 
with vendors?  I have been asked to summarize the impact of 4k sectors 
on linux  for a disk vendor gathering and want to make sure that I put 
all of our linux specific items into that summary...


ric


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-27 Thread Ric Wheeler

Martin K. Petersen wrote:

"Eric" == Moore, Eric <[EMAIL PROTECTED]> writes:


Eric> Martin K. Petersen on Data Intergrity Feature, which is also
Eric> called EEDP(End to End Data Protection), which he presented some
Eric> ideas/suggestions of adding an API in linux for this.  

T10 DIF is interesting for a few things: 


 - Ensuring that the data integrity is preserved when writing a buffer
   to disk

 - Ensuring that the write ends up on the right hardware sector

These features make the most sense in terms of WRITE.  Disks already
have plenty of CRC on the data so if a READ fails on a regular drive
we already know about it.


There are paths through a read that could still benefit from the extra 
data integrity.  The CRC gets validated on the physical sector, but we 
don't have the same level of strict data checking once it is read into 
the disk's write cache or being transferred out of cache on the way to 
the transport...




We can, however, leverage DIF with my proposal to expose the
protection data to host memory.  This will allow us to verify the data
integrity information before passing it to the filesystem or
application.  We can say "this is really the information the disk
sent. It hasn't been mangled along the way".

And by using the APP tag we can mark a sector as - say - metadata or
data to ease putting the recovery puzzle back together.

It would be great if the app tag was more than 16 bits.  Ted mentioned
that ideally he'd like to store the inode number in the app tag.  But
as it stands there isn't room.

In any case this is all slightly orthogonal to Ric's original post
about finding the right persistence heuristics in the error handling
path...



Still all a very relevant discussion - I agree that we could really use 
more than just 16 bits...


ric

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-26 Thread Ric Wheeler



Jeff Garzik wrote:

Theodore Tso wrote:

Can someone with knowledge of current disk drive behavior confirm that
for all drives that support bad block sparing, if an attempt to write
to a particular spot on disk results in an error due to bad media at
that spot, the disk drive will automatically rewrite the sector to a
sector in its spare pool, and automatically redirect that sector to
the new location.  I believe this should be always true, so presumably
with all modern disk drives a write error should mean something very
serious has happend.  



This is what will /probably/ happen.  The drive should indeed find a 
spare sector and remap it, if the write attempt encounters a bad spot on 
the media.


However, with a large enough write, large enough bad-spot-on-media, and 
a firmware programmed to never take more than X seconds to complete 
their enterprise customers' I/O, it might just fail.



IMO, somewhere in the kernel, when we receive a read-op or write-op 
media error, we should immediately try to plaster that area with small 
writes.  Sure, if it's a read-op you lost data, but this method will 
maximize the chance that you can refresh/reuse the logical sectors in 
question.


Jeff


One interesting counter example is a smaller write than a full page - say 512 
bytes out of 4k.


If we need to do a read-modify-write and it just so happens that 1 of the 7 
sectors we need to read is flaky, will this "look" like a write failure?


ric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-26 Thread Ric Wheeler


Alan wrote:
I think that this is mostly true, but we also need to balance this against the 
need for higher levels to get a timely response.  In a really large IO, a naive 
retry of a very large write could lead to a non-responsive system for a very 
large time...


And losing the I/O could result in a system that is non responsive until
the tape restore completes two days later


Which brings us back to a recent discussion at the file system workshop on being 
more repair oriented in file system design so we can survive situations like 
this a bit more reliably ;-)


ric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-26 Thread Ric Wheeler



Alan wrote:

the new location.  I believe this should be always true, so presumably
with all modern disk drives a write error should mean something very
serious has happend. 


Not quite that simple.


I think that write errors are normally quite serious, but there are exceptions 
which might be able to be worked around with retries.  To Ted's point, in 
general, a write to a bad spot on the media will cause a remapping which should 
be transparent (if a bit slow) to us.




If you write a block aligned size the same size as the physical media
block size maybe this is true. If you write a sector on a device with
physical sector size larger than logical block size (as allowed by say
ATA7) then it's less clear what happens. I don't know if the drive
firmware implements multiple "tails" in this case.

On a read error it is worth trying the other parts of the I/O.



I think that this is mostly true, but we also need to balance this against the 
need for higher levels to get a timely response.  In a really large IO, a naive 
retry of a very large write could lead to a non-responsive system for a very 
large time...


ric



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


end to end error recovery musings

2007-02-23 Thread Ric Wheeler
In the IO/FS workshop, one idea we kicked around is the need to provide 
better and more specific error messages between the IO stack and the 
file system layer.


My group has been working to stabilize a relatively up to date libata + 
MD based box, so I can try to lay out at least one "appliance like" 
typical configuration to help frame the issue. We are working on a 
relatively large appliance, but you can buy similar home appliance (or 
build them) that use linux to provide a NAS in a box for end users.


The use case that we have is on an ICH6R/AHCI box with 4 large (500+ GB) 
drives, with some of the small system partitions on a 4-way RAID1 
device. The libata version we have is back port of 2.6.18 onto SLES10, 
so the error handling at the libata level is a huge improvement over 
what we had before.


Each box has a watchdog timer that can be set to fire after at most 2 
minutes.


(We have a second flavor of this box with an ICH5 and P-ATA drives using 
the non-libata drivers that has a similar use case).


Using the patches that Mark sent around recently for error injection, we 
inject media errors into one or more drives and try to see how smoothly 
error handling runs and, importantly, whether or not the error handling 
will complete before the watchdog fires and reboots the box.  If you 
want to be especially mean, inject errors into the RAID superblocks on 3 
out of the 4 drives.


We still have the following challenges:

   (1) read-ahead often means that we will  retry every bad sector at 
least twice from the file system level. The first time, the fs read 
ahead request triggers a speculative read that includes the bad sector 
(triggering the error handling mechanisms) right before the real 
application triggers a read does the same thing.  Not sure what the 
answer is here since read-ahead is obviously a huge win in the normal case.


   (2) the patches that were floating around on how to make sure that 
we effectively handle single sector errors in a large IO request are 
critical. On one hand, we want to combine adjacent IO requests into 
larger IO's whenever possible. On the other hand, when the combined IO 
fails, we need to isolate the error to the correct range, avoid 
reissuing a request that touches that sector again and communicate up 
the stack to file system/MD what really failed.  All of this needs to 
complete in tens of seconds, not multiple minutes.


   (3) The timeout values on the failed IO's need to be tuned well (as 
was discussed in an earlier linux-ide thread). We cannot afford to hang 
for 30 seconds, especially in the MD case, since you might need to fail 
more than one device for a single IO.  Prompt error prorogation (say 
that 4 times quickly!) can allow MD to mask the underlying errors as you 
would hope, hanging on too long will almost certainly cause a watchdog 
reboot...


   (4) The newish libata+SCSI stack is pretty good at handling disk 
errors, but adding in MD actually can reduce the reliability of your 
system unless you tune the error handling correctly.


We will follow up with specific issues as they arise, but I wanted to 
lay out a use case that can help frame part of the discussion.  I also 
want to encourage people to inject real disk errors with the Mark 
patches so we can share the pain ;-)


ric



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-01 Thread Ric Wheeler



Andreas Dilger wrote:

On Nov 29, 2006  09:04 +, Christoph Hellwig wrote:

 - readdirplus

This one is completely unneeded as a kernel API.  Doing readdir
plus calls on the wire makes a lot of sense and we already do
that for NFSv3+.  Doing this at the syscall layer just means
kernel bloat - syscalls are very cheap.


The question is how does the filesystem know that the application is
going to do readdir + stat every file?  It has to do this as a heuristic
implemented in the filesystem to determine if the ->getattr() calls match
the ->readdir() order.  If the application knows that it is going to be
doing this (e.g. ls, GNU rm, find, etc) then why not let the filesystem
take advantage of this information?  If combined with the statlite
interface, it can make a huge difference for clustered filesystems.




I think that this kind of heuristic would be a win for local file systems with a 
huge number of files as well...


ric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS corruption during power-blackout

2005-07-11 Thread Ric Wheeler



Jens Axboe wrote:


On Fri, Jul 01 2005, Bryan Henderson wrote:
 

Wouldn't a commercial class drive that ignores explicit flushes be 
infamous?  I'm ready to accept that there are SCSI drives that cache 
writes in volatile storage by default (but frankly, I'm still skeptical), 
but I'm not ready to accept that there are drives out there secretly 
ignoring explicit commands to harden data, thus jeopardizing millions of 
dollars' worth of data.  I'd need more evidence.
   



I'm pretty sure I have an IBM drive that does so (its flush cache
command is _really_ fast), as a matter of fact :-) I need to locate it
and put it in a test box to re-ensure this.

I'm not sure such drives would necessarily be infamous, hardly anyone
would notice anything wrong in a desktop type machine. Which is what
these drives were made for.
 

One other thing to keep in mind is that drive firmware can have bugs 
just like any other bit of code, so a drive may have a bug in one 
firmware revision that gets fixed in a following one. 

I am not sure how much that other operating system uses flush cache 
commands, but until the write barrier patch,  it has been a relatively 
rarely issued command for Linux and breakage would not be noticed.



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html