Re: Some very basic questions

2008-10-22 Thread Tejun Heo
Ric Wheeler wrote: >> FS waiting for completion of all the dependent writes isn't too good >> latency and throughput-wise tho. It would be best if FS can indicate >> dependencies between write commands and barrier so that barrier >> doesn't have to empty the whole queue. Hmm... Can someone tell m

Re: Some very basic questions

2008-10-22 Thread Ric Wheeler
Eric Anopolsky wrote: On Thu, 2008-10-23 at 01:14 +0900, Tejun Heo wrote: Ric Wheeler wrote: Waiting for the target to ack an IO is not sufficient, since the target ack does not (with write cache enabled) mean that it is on persistent storage. FS waiting for completion of all th

Re: Some very basic questions

2008-10-22 Thread Eric Anopolsky
On Thu, 2008-10-23 at 01:14 +0900, Tejun Heo wrote: > Ric Wheeler wrote: > > Waiting for the target to ack an IO is not sufficient, since the target > > ack does not (with write cache enabled) mean that it is on persistent > > storage. > > FS waiting for completion of all the dependent writes isn'

Re: Some very basic questions

2008-10-22 Thread Ric Wheeler
Avi Kivity wrote: Tejun Heo wrote: For most SATA drives, disabling write back cache seems to take high toll on write throughput. :-( I measured this yesterday. This is true for pure write workloads; for mixed read/write workloads the throughput decrease is negligible. Depends on your

Re: BTRFS Performance page

2008-10-22 Thread Steven Pratt
Paul P Komkoff Jr wrote: Replying to Steven Pratt: Steven Pratt wrote: RAID data is now uploaded. The config used is 136 15k rpm fiber disks in 8 arrays all striped together with DM. These results are not as favorable to BTRFS, as there seem to be some major issues with random write a

Re: BTRFS Performance page

2008-10-22 Thread Paul P Komkoff Jr
Replying to Steven Pratt: > Steven Pratt wrote: > RAID data is now uploaded. The config used is 136 15k rpm fiber disks > in 8 arrays all striped together with DM. These results are not as > favorable to BTRFS, as there seem to be some major issues with random > write and mail server worklo

Re: Some very basic questions

2008-10-22 Thread Avi Kivity
jim owens wrote: For most SATA drives, disabling write back cache seems to take high toll on write throughput. :-( I measured this yesterday. This is true for pure write workloads; for mixed read/write workloads the throughput decrease is negligible. Different tests on different hardware g

Re: Some very basic questions

2008-10-22 Thread jim owens
Avi Kivity wrote: Tejun Heo wrote: For most SATA drives, disabling write back cache seems to take high toll on write throughput. :-( I measured this yesterday. This is true for pure write workloads; for mixed read/write workloads the throughput decrease is negligible. Different tests on d

Re: Some very basic questions

2008-10-22 Thread Avi Kivity
Tejun Heo wrote: For most SATA drives, disabling write back cache seems to take high toll on write throughput. :-( I measured this yesterday. This is true for pure write workloads; for mixed read/write workloads the throughput decrease is negligible. As long as the error status is sti

Re: Some very basic questions

2008-10-22 Thread Avi Kivity
Ric Wheeler wrote: For any given set of disks, you "just" need to do the math to compute the utilized capacity, the expected rate of drive failure, the rebuild time and then see whether you can recover from your first failure before a 2nd disk dies. Spare disks have the advantage of a fully

Re: Some very basic questions

2008-10-22 Thread jim owens
Michel Salim wrote: Though it would be nice to have a tool that would provide enough information to make a warranty claim -- does btrfs keep enough information for such a tool to be written? Failed device I/O (rather than bad checksums and other fs-specific error detections) should be logged a

Re: Some very basic questions

2008-10-22 Thread Ric Wheeler
Tejun Heo wrote: Ric Wheeler wrote: Waiting for the target to ack an IO is not sufficient, since the target ack does not (with write cache enabled) mean that it is on persistent storage. FS waiting for completion of all the dependent writes isn't too good latency and throughput-wise th

Re: Some very basic questions

2008-10-22 Thread Tejun Heo
Ric Wheeler wrote: > Waiting for the target to ack an IO is not sufficient, since the target > ack does not (with write cache enabled) mean that it is on persistent > storage. FS waiting for completion of all the dependent writes isn't too good latency and throughput-wise tho. It would be best if

Re: Some very basic questions

2008-10-22 Thread Chris Mason
On Wed, 2008-10-22 at 11:25 -0400, Ric Wheeler wrote: > Avi Kivity wrote: > > Ric Wheeler wrote: > >>> > >>> Well, btrfs is not about duplicating how most storage works today. > >>> Spare capacity has significant advantages over spare disks, such as > >>> being able to mix disk sizes, RAID level

Re: Some very basic questions

2008-10-22 Thread Michel Salim
On Wed, Oct 22, 2008 at 9:52 AM, Stephan von Krawczynski <[EMAIL PROTECTED]> wrote: > On Wed, 22 Oct 2008 09:15:45 -0400 > Chris Mason <[EMAIL PROTECTED]> wrote: > >> On Wed, 2008-10-22 at 14:27 +0200, Stephan von Krawczynski wrote: >> > On Tue, 21 Oct 2008 13:31:37 -0400 >> > Ric Wheeler <[EMAIL P

Re: Some very basic questions

2008-10-22 Thread Ric Wheeler
Avi Kivity wrote: Chris Mason wrote: One problem with the spare capacity model is the general trend where drives from the same batch that get hammered on in the same way tend to die at the same time. Some shops will sleep better knowing there's a hot spare and that's fine by me. How does h

Re: BTRFS Performance page

2008-10-22 Thread Chris Mason
On Wed, 2008-10-22 at 10:45 -0500, Steven Pratt wrote: > Chris Mason wrote: > > On Wed, 2008-10-22 at 10:00 -0500, Steven Pratt wrote: > > > >> Steven Pratt wrote: > >> > >>> As discussed on the BTRFS conference call, myself and Kevin Corry have > >>> set up some test machines for the purp

Re: BTRFS Performance page

2008-10-22 Thread Steven Pratt
Chris Mason wrote: On Wed, 2008-10-22 at 10:00 -0500, Steven Pratt wrote: Steven Pratt wrote: As discussed on the BTRFS conference call, myself and Kevin Corry have set up some test machines for the purpose of doing performance testing on BTRFS. The intent is to have a semi permanent

Re: Some very basic questions

2008-10-22 Thread Avi Kivity
Chris Mason wrote: One problem with the spare capacity model is the general trend where drives from the same batch that get hammered on in the same way tend to die at the same time. Some shops will sleep better knowing there's a hot spare and that's fine by me. How does hot sparing help? A

Re: Some very basic questions

2008-10-22 Thread Avi Kivity
Ric Wheeler wrote: I think that the btrfs plan is still to push more complicated RAID schemes off to MD (RAID6, etc) so this is an issue even with a JBOD. It will be interesting to map out the possible ways to use built in mirroring, etc vs the external RAID and actually measure the utilized c

Re: Some very basic questions

2008-10-22 Thread Ric Wheeler
Avi Kivity wrote: Ric Wheeler wrote: Well, btrfs is not about duplicating how most storage works today. Spare capacity has significant advantages over spare disks, such as being able to mix disk sizes, RAID levels, and better performance. Sure, there are advantages that go in favour of one

Re: BTRFS Performance page

2008-10-22 Thread Chris Mason
On Wed, 2008-10-22 at 10:00 -0500, Steven Pratt wrote: > Steven Pratt wrote: > > As discussed on the BTRFS conference call, myself and Kevin Corry have > > set up some test machines for the purpose of doing performance testing > > on BTRFS. The intent is to have a semi permanent setup that we ca

Re: Some very basic questions

2008-10-22 Thread Avi Kivity
Ric Wheeler wrote: Well, btrfs is not about duplicating how most storage works today. Spare capacity has significant advantages over spare disks, such as being able to mix disk sizes, RAID levels, and better performance. Sure, there are advantages that go in favour of one or the other appr

Re: Some very basic questions

2008-10-22 Thread Ric Wheeler
Avi Kivity wrote: Ric Wheeler wrote: You want to have spare capacity, enough for one or two (or fifteen) drives' worth of data. When a drive goes bad, you rebuild into the spare capacity you have. That is a different model (and one that makes sense, we used that in Centera for object level

Re: BTRFS Performance page

2008-10-22 Thread Steven Pratt
Steven Pratt wrote: As discussed on the BTRFS conference call, myself and Kevin Corry have set up some test machines for the purpose of doing performance testing on BTRFS. The intent is to have a semi permanent setup that we can use to test new features and code drops in BTRFS as well as to do

Re: Some very basic questions

2008-10-22 Thread Avi Kivity
Ric Wheeler wrote: You want to have spare capacity, enough for one or two (or fifteen) drives' worth of data. When a drive goes bad, you rebuild into the spare capacity you have. That is a different model (and one that makes sense, we used that in Centera for object level protection schemes)

Re: Some very basic questions

2008-10-22 Thread Ric Wheeler
Avi Kivity wrote: Ric Wheeler wrote: One key is not to replace the drives too early - you often can recover significant amounts of data from a drive that is on its last legs. This can be useful even in RAID rebuilds since with today's enormous drive capacities, you might hit a latent error dur

Re: Some very basic questions

2008-10-22 Thread jim owens
Ric Wheeler wrote: Matthias Wächter wrote: On 10/22/2008 3:50 PM, Chris Mason wrote: Let me reword my answer ;). The next write will always succeed unless the drive is out of remapping sectors. If the drive is out, it is only good for reads and holding down paper on your desk. I hav

Re: Some very basic questions

2008-10-22 Thread Avi Kivity
Chris Mason wrote: You want to have spare capacity, enough for one or two (or fifteen) drives' worth of data. When a drive goes bad, you rebuild into the spare capacity you have. You want spare capacity that does not degrade your raid levels if you move the data onto it. In some confi

Re: Some very basic questions

2008-10-22 Thread Chris Mason
On Wed, 2008-10-22 at 16:32 +0200, Avi Kivity wrote: > Ric Wheeler wrote: > > One key is not to replace the drives too early - you often can recover > > significant amounts of data from a drive that is on its last legs. > > This can be useful even in RAID rebuilds since with today's enormous >

Re: Some very basic questions

2008-10-22 Thread dbz
concerning this discussion, I'd like to put up some "requests" which strongly oppose to those brought up initially: - if you run into an error in the fs structure or any IO error that prevents you from bringing the fs into a consistent state, please simply oops. If a user feels that availabili

Re: Some very basic questions

2008-10-22 Thread Ric Wheeler
Matthias Wächter wrote: On 10/22/2008 3:50 PM, Chris Mason wrote: Let me reword my answer ;). The next write will always succeed unless the drive is out of remapping sectors. If the drive is out, it is only good for reads and holding down paper on your desk. I have a fairly new SATA

Re: Some very basic questions

2008-10-22 Thread Avi Kivity
Ric Wheeler wrote: One key is not to replace the drives too early - you often can recover significant amounts of data from a drive that is on its last legs. This can be useful even in RAID rebuilds since with today's enormous drive capacities, you might hit a latent error during the rebuild on

Re: Some very basic questions

2008-10-22 Thread Avi Kivity
jim owens wrote: Avi Kivity wrote: jim owens wrote: Remember that the device bandwidth is the limiter so even when each host has a dedicated path to the device (as in dual port SAS or FC), that 2nd host cuts the throughput by more than 1/2 with uncoordinated seeks and transfers. That's only

Re: Some very basic questions

2008-10-22 Thread Ric Wheeler
Chris Mason wrote: On Wed, 2008-10-22 at 09:38 -0400, Ric Wheeler wrote: Chris Mason wrote: On Wed, 2008-10-22 at 22:15 +0900, Tejun Heo wrote: Ric Wheeler wrote: I think that we do handle a failure in the case that you outline above since the FS will be able t

Re: Some very basic questions

2008-10-22 Thread jim owens
Avi Kivity wrote: jim owens wrote: Remember that the device bandwidth is the limiter so even when each host has a dedicated path to the device (as in dual port SAS or FC), that 2nd host cuts the throughput by more than 1/2 with uncoordinated seeks and transfers. That's only a problem if there

Re: BTRFS Performance page

2008-10-22 Thread Chris Mason
On Wed, 2008-10-22 at 08:53 -0500, Steven Pratt wrote: > Chris Mason wrote: > > On Tue, Oct 21, 2008 at 05:20:03PM -0500, Steven Pratt wrote: > > > >> As discussed on the BTRFS conference call, myself and Kevin Corry have > >> set up some test machines for the purpose of doing performance test

Re: Some very basic questions

2008-10-22 Thread Matthias Wächter
On 10/22/2008 3:50 PM, Chris Mason wrote: > Let me reword my answer ;). The next write will always succeed unless > the drive is out of remapping sectors. If the drive is out, it is only > good for reads and holding down paper on your desk. I have a fairly new SATA disk with about 3000 hours of

Re: Some very basic questions

2008-10-22 Thread Stephan von Krawczynski
On Wed, 22 Oct 2008 05:48:30 -0700 "Jeff Schroeder" <[EMAIL PROTECTED]> wrote: > > NFS is a good example for a fs that never got redesigned for modern world. I > > hope it will, but currently it's like Model T on a highway. > > You have a NFS server with clients. Your NFS server dies, your backup

Re: Some very basic questions

2008-10-22 Thread Chris Mason
On Wed, 2008-10-22 at 09:38 -0400, Ric Wheeler wrote: > Chris Mason wrote: > > On Wed, 2008-10-22 at 22:15 +0900, Tejun Heo wrote: > > > >> Ric Wheeler wrote: > >> > >>> I think that we do handle a failure in the case that you outline above > >>> since the FS will be able to notice the erro

Re: BTRFS Performance page

2008-10-22 Thread Steven Pratt
Chris Mason wrote: On Tue, Oct 21, 2008 at 05:20:03PM -0500, Steven Pratt wrote: As discussed on the BTRFS conference call, myself and Kevin Corry have set up some test machines for the purpose of doing performance testing on BTRFS. The intent is to have a semi permanent setup that we can

Re: Some very basic questions

2008-10-22 Thread Stephan von Krawczynski
On Wed, 22 Oct 2008 09:15:45 -0400 Chris Mason <[EMAIL PROTECTED]> wrote: > On Wed, 2008-10-22 at 14:27 +0200, Stephan von Krawczynski wrote: > > On Tue, 21 Oct 2008 13:31:37 -0400 > > Ric Wheeler <[EMAIL PROTECTED]> wrote: > > > > > [...] > > > If you have remapped a big chunk of the sectors (sa

Re: Some very basic questions

2008-10-22 Thread Chris Mason
On Wed, 2008-10-22 at 14:19 +0200, Stephan von Krawczynski wrote: > On Tue, 21 Oct 2008 13:49:43 -0400 > Chris Mason <[EMAIL PROTECTED]> wrote: > > > On Tue, 2008-10-21 at 18:27 +0200, Stephan von Krawczynski wrote: > > > > > > > 2. general requirements > > > > > - fs errors without file/dir

Re: Some very basic questions

2008-10-22 Thread Ric Wheeler
Chris Mason wrote: On Wed, 2008-10-22 at 22:15 +0900, Tejun Heo wrote: Ric Wheeler wrote: I think that we do handle a failure in the case that you outline above since the FS will be able to notice the error before it sends a commit down (and that commit is wrapped in the barrier flush c

Re: Some very basic questions

2008-10-22 Thread Ric Wheeler
Chris Mason wrote: On Wed, 2008-10-22 at 14:27 +0200, Stephan von Krawczynski wrote: On Tue, 21 Oct 2008 13:31:37 -0400 Ric Wheeler <[EMAIL PROTECTED]> wrote: [...] If you have remapped a big chunk of the sectors (say more than 10%), you should grab the data off the disk asap and repl

Re: Some very basic questions

2008-10-22 Thread Ric Wheeler
Tejun Heo wrote: Ric Wheeler wrote: I think that we do handle a failure in the case that you outline above since the FS will be able to notice the error before it sends a commit down (and that commit is wrapped in the barrier flush calls). This is the easy case since we still have the context

Re: Some very basic questions

2008-10-22 Thread Chris Mason
On Wed, 2008-10-22 at 22:15 +0900, Tejun Heo wrote: > Ric Wheeler wrote: > > I think that we do handle a failure in the case that you outline above > > since the FS will be able to notice the error before it sends a commit > > down (and that commit is wrapped in the barrier flush calls). This is >

Re: Some very basic questions

2008-10-22 Thread Avi Kivity
Ric Wheeler wrote: Scrubbing is key for many scenarios since errors can "grow" even in places where previous IO has been completed without flagging an error. Some neat tricks are: (1) use block level scrubbing to detect any media errors. If you can map that sector level error into a file s

Re: Some very basic questions

2008-10-22 Thread Tejun Heo
Ric Wheeler wrote: > I think that we do handle a failure in the case that you outline above > since the FS will be able to notice the error before it sends a commit > down (and that commit is wrapped in the barrier flush calls). This is > the easy case since we still have the context for the IO. I

Re: Some very basic questions

2008-10-22 Thread Chris Mason
On Wed, 2008-10-22 at 14:27 +0200, Stephan von Krawczynski wrote: > On Tue, 21 Oct 2008 13:31:37 -0400 > Ric Wheeler <[EMAIL PROTECTED]> wrote: > > > [...] > > If you have remapped a big chunk of the sectors (say more than 10%), you > > should grab the data off the disk asap and replace it. Worry

Re: Some very basic questions

2008-10-22 Thread Chris Mason
On Wed, 2008-10-22 at 09:03 -0400, Ric Wheeler wrote: > Avi Kivity wrote: > > Stephan von Krawczynski wrote: > >> > >>>- filesystem autodetects, isolates, and (possibly) repairs errors > >>>- online "scan, check, repair filesystem" tool initiated by admin > >>>- Reliability so high that

Re: Some very basic questions

2008-10-22 Thread Ric Wheeler
Avi Kivity wrote: Stephan von Krawczynski wrote: - filesystem autodetects, isolates, and (possibly) repairs errors - online "scan, check, repair filesystem" tool initiated by admin - Reliability so high that they never run that check-and-fix tool That is _wrong_ (to a certain e

Re: Some very basic questions

2008-10-22 Thread Ric Wheeler
Tejun Heo wrote: Ric Wheeler wrote: The cache flush command for ATA devices will block and wait until all of the device's write cache has been written back. What I assume Tejun was referring to here is that some IO might have been written out to the device and an error happened when the devi

Re: Some very basic questions

2008-10-22 Thread Ric Wheeler
Tejun Heo wrote: Ric Wheeler wrote: The cache flush command for ATA devices will block and wait until all of the device's write cache has been written back. What I assume Tejun was referring to here is that some IO might have been written out to the device and an error happened when the devi

[PATCH] nuke fs wide allocation mutex V2

2008-10-22 Thread Josef Bacik
Hello, This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch of little locks. There is now a pinned_mutex, which is used when messing with the pinned_extents extent io tree, and the extent_ins_mutex which is used with the pending_del and extent_ins extent io trees. The l

Re: Some very basic questions

2008-10-22 Thread Jeff Schroeder
On Wed, Oct 22, 2008 at 5:19 AM, Stephan von Krawczynski <[EMAIL PROTECTED]> wrote: > On Tue, 21 Oct 2008 13:49:43 -0400 > Chris Mason <[EMAIL PROTECTED]> wrote: > >> On Tue, 2008-10-21 at 18:27 +0200, Stephan von Krawczynski wrote: >> >> > > > 2. general requirements >> > > > - fs errors witho

Re: Some very basic questions

2008-10-22 Thread Stephan von Krawczynski
On Tue, 21 Oct 2008 13:31:37 -0400 Ric Wheeler <[EMAIL PROTECTED]> wrote: > [...] > If you have remapped a big chunk of the sectors (say more than 10%), you > should grab the data off the disk asap and replace it. Worry less about > errors during read, writes indicate more serious errors. Ok, n

Re: Some very basic questions

2008-10-22 Thread Stephan von Krawczynski
On Tue, 21 Oct 2008 13:49:43 -0400 Chris Mason <[EMAIL PROTECTED]> wrote: > On Tue, 2008-10-21 at 18:27 +0200, Stephan von Krawczynski wrote: > > > > > 2. general requirements > > > > - fs errors without file/dir names are useless > > > > - errors in parts of the fs are no reason for a fs

Re: Some very basic questions

2008-10-22 Thread Avi Kivity
Stephan von Krawczynski wrote: - filesystem autodetects, isolates, and (possibly) repairs errors - online "scan, check, repair filesystem" tool initiated by admin - Reliability so high that they never run that check-and-fix tool That is _wrong_ (to a certain extent). You _want t

Re: Some very basic questions

2008-10-22 Thread Stephan von Krawczynski
On Tue, 21 Oct 2008 18:59:26 +0200 Andi Kleen <[EMAIL PROTECTED]> wrote: > Stephan von Krawczynski <[EMAIL PROTECTED]> writes: > > > > Yes, we hear and say that all the time, name one linux fs doing it, please. > > ext[234] support it to some extent. It has some limitations > (especially when the

Re: Some very basic questions

2008-10-22 Thread Stephan von Krawczynski
On Tue, 21 Oct 2008 18:09:40 +0200 Andi Kleen <[EMAIL PROTECTED]> wrote: > While that's true today, I'm not sure it has to be true always. > I always thought traditional fsck user interfaces were a > UI desaster and could be done much better with some simple tweaks. > [...] You are completely ri

Re: Some very basic questions

2008-10-22 Thread Stephan von Krawczynski
On Tue, 21 Oct 2008 13:15:13 -0400 Christoph Hellwig <[EMAIL PROTECTED]> wrote: > On Tue, Oct 21, 2008 at 07:01:36PM +0200, Stephan von Krawczynski wrote: > > Sure, but what you say only reflects the ideal world. On a file service, you > > never have that. In fact you do not even have good control

Re: Some very basic questions

2008-10-22 Thread Stephan von Krawczynski
On Tue, 21 Oct 2008 11:34:20 -0400 jim owens <[EMAIL PROTECTED]> wrote: > Hearing what user's think they want is always good, but... > > Stephan von Krawczynski wrote: > > > > thanks for your feedback. Understand "minimum requirement" as "minimum > > requirement to drop the current installation

Re: Some very basic questions

2008-10-22 Thread Tejun Heo
Ric Wheeler wrote: > The cache flush command for ATA devices will block and wait until all of > the device's write cache has been written back. > > What I assume Tejun was referring to here is that some IO might have > been written out to the device and an error happened when the device > tried to

Re: Some very basic questions

2008-10-22 Thread Ric Wheeler
Eric Anopolsky wrote: On Tue, 2008-10-21 at 18:18 -0400, Ric Wheeler wrote: Eric Anopolsky wrote: On Tue, 2008-10-21 at 09:59 -0400, Chris Mason wrote: - power loss at any time must not corrupt the fs (atomic fs modification) (new-data loss is acceptable)

Re: Some very basic questions

2008-10-22 Thread Avi Kivity
jim owens wrote: Remember that the device bandwidth is the limiter so even when each host has a dedicated path to the device (as in dual port SAS or FC), that 2nd host cuts the throughput by more than 1/2 with uncoordinated seeks and transfers. That's only a problem if there is a single shared