Re: RAID device nomination (Feature request)

2013-04-18 Thread Alex Elsayed
Martin wrote:



> Or perhaps include the same Ceph code routines into btrfs?...

That's actually what I was thinking. The CRUSH code is actually already 
pretty well factored out - it lives in net/ceph/crush/ in the kernel source 
tree, and is treated as part of 'libceph' (which is used by both the Ceph 
filesystem in fs/ceph/ and the RBD block device in drivers/block/rbd.c).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID device nomination (Feature request)

2013-04-18 Thread Martin
On 18/04/13 20:48, Alex Elsayed wrote:
> Hugo Mills wrote:
> 
>> On Thu, Apr 18, 2013 at 02:45:24PM +0100, Martin wrote:
>>> Dear Devs,
> 
>>> Note that esata shows just the disks as individual physical disks, 4 per
>>> disk pack. Can physical disks be grouped together to force the RAID data
>>> to be mirrored across all the nominated groups?
>>
>>Interesting you should ask this: I realised quite recently that
>> this could probably be done fairly easily with a modification to the
>> chunk allocator.
> 
> 
> One thing that might be an interesting approach:
> 
> Ceph is already in mainline, and uses CRUSH in a similar way to what's 
> described (topology-aware placement+replication). Ceph does it by OSD nodes 
> rather than disk, and the units are objects rather than chunks, but it could 
> potentially be a rather good fit.
> 
> CRUSH does it by describing a topology hierarchy, and allocating the OSD ids 
> to that hierarchy. It then uses that to map from a key to one-or-more 
> locations. If we use chunk ID as the key, and use UUID_SUB in place of the 
> OSD id, it could do the job.

OK... That was a bit of a crash course (ok, sorry for the pun on crush :-) )

http://www.anchor.com.au/blog/2012/09/a-crash-course-in-ceph/


Interesting that the "CRUSH map is written by hand, then compiled and
passed to the cluster".

Hence, looks like simply have the sysadmin specify what gets grouped
into what group. (I certainly know what disk is where and where I want
the data mirrored!)


For my example, the disk packs are plugged into two servers (up to four
at a time at present) so that we have some fail-over if one server dies.
Ceph looks to be a little overkill for just two big storage users.

Or perhaps include the same Ceph code routines into btrfs?...


Regards,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID device nomination (Feature request)

2013-04-18 Thread Martin
On 18/04/13 20:44, Hugo Mills wrote:
> On Thu, Apr 18, 2013 at 05:29:10PM +0100, Martin wrote:
>> On 18/04/13 15:06, Hugo Mills wrote:
>>> On Thu, Apr 18, 2013 at 02:45:24PM +0100, Martin wrote:
 Dear Devs,
 
 I have a number of esata disk packs holding 4 physical disks
 each where I wish to use the disk packs aggregated for 16TB
 and up to 64TB backups...
 
 Can btrfs...?
 
 1:
 
 Mirror data such that there is a copy of data on each *disk
 pack* ?
 
 Note that esata shows just the disks as individual physical 
 disks, 4 per disk pack. Can physical disks be grouped
 together to force the RAID data to be mirrored across all the
 nominated groups?
>>> 
>>> Interesting you should ask this: I realised quite recently that
>>>  this could probably be done fairly easily with a modification
>>> to the chunk allocator.
>> 
>> Hey, that sounds good. And easy? ;-)
>> 
>> Possible?...
> 
> We'll see... I'm a bit busy for the next week or so, but I'll see 
> what I can do.

Thanks greatly. That should nicely let me stay with my "plan A" and
just let btrfs conveniently expand over multiple disk packs :-)

(I'm playing 'safe' for the moment while I can by putting in bigger
disks into new packs as needed. I've some packs with smaller disks
that are nearly full that I want to continue to use so I'm agonising
over whether to replace all the disks and rewrite all the data or use
multiple disk packs as one. "Plan A" is good for keeping the existing
disks :-) )


[...]
>> The question is how the groups of disks are determined:
>> 
>> Manually by the user for mkfs.btrfs and/or specified when disks
>> are added/replaced;
>> 
>> Or somehow automatically detected (but with a user override).
>> 
>> 
>> Have a "disk group" UUID for a group of disks similar to that
>> done for md-raid?
> 
> I was planning on simply having userspace assign a (small) integer 
> to each device. Devices with the same integer are in the same
> group, and won't have more than one copy of any given piece of data
> assigned to them. Note that there's already an unused "disk group"
> item which is a 32-bit integer in the device structure, which looks
> like it can be repurposed for this; there's no spare space in the
> device structure, so anything more than that will involve some kind
> of disk format change.

The "repurpose for no format change" sounds very good and 32-bits
should be enough for anyone. (Notwithstanding the inevitable 640k
comments!)

A 32-bit unsigned-int number that the user specifies? Or include a
semi-random automatic numbering to a group of devices listed by the
user?...

Then again, I can't imagine anyone wanting to go beyond 8-bits...
Hence a 16-bit unsigned int is still suitably overkill. That then
offers the other 16-bits for some other repurpose ;-)


For myself, it would be nice to be able to specify a number that is
the same unique number that's stamped on the disk packs so that I can
be sure what has been plugged in! (Assuming there's some option to
list what's been plugged in.)


 3:
 
 Also, for different speeds of disks, can btrfs tune itself
 to balance the read/writes accordingly?
>>> 
>>> Not that I'm aware of.
>> 
>> A 'nice to have' would be some sort of read-access load balancing
>> with options to balance latency or queue depth... Could btrfs do
>> that independently (complimentary with) of the block layer
>> schedulers?
> 
> All things are possible... :) Whether it's something that someone 
> will actually do or not, I don't know. There's an argument for
> getting some policy into that allocation decision for other
> purposes (e.g. trying to ensure that if a disk dies from a
> filesystem with "single" allocation, you lose the fewest number of
> files).
> 
> On the other hand, this is probably going to be one of those
> things that could have really nasty performance effects. It's also
> somewhat beyond my knowledge right now, so someone else will have
> to look at it. :)

Sounds ideal for some university research ;-)


[...]
>> For example, if for an SSD, the next free space allocation for 
>> whatever is to be newly written could become more like a log
>> based round-robin allocation across the entire SSD (NILFS-like?)
>> rather than trying to localise data to minimise the physical head
>> movement as for a HDD.
>> 
>> Or is there no useful gain with that over simply using the same
>> one lump of allocator code as for HDDs?
> 
> No idea. It's going to need someone to write the code and
> benchmark the options, I suspect.

A second university project? ;-)


[...]
>> (And there's always the worry of the esata lead getting yanked to
>> take out all four disks...)
> 
> As I said, I've done the latter myself. The array *should* go into

Looks like I'll likely get to find out for myself sometime or other...



Thanks for your help and keep me posted please.

I'll be experimenting with the groupings as soon as they come along.
Also for the 

Re: RAID device nomination (Feature request)

2013-04-18 Thread Alex Elsayed
Hugo Mills wrote:

> On Thu, Apr 18, 2013 at 02:45:24PM +0100, Martin wrote:
>> Dear Devs,

>> Note that esata shows just the disks as individual physical disks, 4 per
>> disk pack. Can physical disks be grouped together to force the RAID data
>> to be mirrored across all the nominated groups?
> 
>Interesting you should ask this: I realised quite recently that
> this could probably be done fairly easily with a modification to the
> chunk allocator.


One thing that might be an interesting approach:

Ceph is already in mainline, and uses CRUSH in a similar way to what's 
described (topology-aware placement+replication). Ceph does it by OSD nodes 
rather than disk, and the units are objects rather than chunks, but it could 
potentially be a rather good fit.

CRUSH does it by describing a topology hierarchy, and allocating the OSD ids 
to that hierarchy. It then uses that to map from a key to one-or-more 
locations. If we use chunk ID as the key, and use UUID_SUB in place of the 
OSD id, it could do the job.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID device nomination (Feature request)

2013-04-18 Thread Hugo Mills
On Thu, Apr 18, 2013 at 05:29:10PM +0100, Martin wrote:
> On 18/04/13 15:06, Hugo Mills wrote:
> > On Thu, Apr 18, 2013 at 02:45:24PM +0100, Martin wrote:
> >> Dear Devs,
> >> 
> >> I have a number of esata disk packs holding 4 physical disks each
> >> where I wish to use the disk packs aggregated for 16TB and up to
> >> 64TB backups...
> >> 
> >> Can btrfs...?
> >> 
> >> 1:
> >> 
> >> Mirror data such that there is a copy of data on each *disk pack*
> >> ?
> >> 
> >> Note that esata shows just the disks as individual physical
> >> disks, 4 per disk pack. Can physical disks be grouped together to
> >> force the RAID data to be mirrored across all the nominated
> >> groups?
> > 
> > Interesting you should ask this: I realised quite recently that 
> > this could probably be done fairly easily with a modification to
> > the chunk allocator.
> 
> Hey, that sounds good. And easy? ;-)
> 
> Possible?...

   We'll see... I'm a bit busy for the next week or so, but I'll see
what I can do.

> >> 2:
> >> 
> >> Similarly for a mix of different storage technologies such as 
> >> manufacturer or type (SSD/HDD), can the disks be grouped to
> >> ensure a copy of the data is replicated across all the groups?
> >> 
> >> For example, I deliberately buy HDDs from different 
> >> batches/manufacturers to try to avoid common mode or similarly
> >> timed failures. Can btrfs be guided to safely spread the RAID
> >> data across the *different* hardware types/batches?
> > 
> > From the kernel point of view, this is the same question as the 
> > previous one.
> 
> Indeed so.
> 
> The question is how the groups of disks are determined:
> 
> Manually by the user for mkfs.btrfs and/or specified when disks are
> added/replaced;
> 
> Or somehow automatically detected (but with a user override).
> 
> 
> Have a "disk group" UUID for a group of disks similar to that done for
> md-raid?

   I was planning on simply having userspace assign a (small) integer
to each device. Devices with the same integer are in the same group,
and won't have more than one copy of any given piece of data assigned
to them. Note that there's already an unused "disk group" item which
is a 32-bit integer in the device structure, which looks like it can
be repurposed for this; there's no spare space in the device
structure, so anything more than that will involve some kind of disk
format change.

> >> 3:
> >> 
> >> Also, for different speeds of disks, can btrfs tune itself to
> >> balance the read/writes accordingly?
> > 
> > Not that I'm aware of.
> 
> A 'nice to have' would be some sort of read-access load balancing with
> options to balance latency or queue depth... Could btrfs do that
> independently (complimentary with) of the block layer schedulers?

   All things are possible... :) Whether it's something that someone
will actually do or not, I don't know. There's an argument for getting
some policy into that allocation decision for other purposes (e.g.
trying to ensure that if a disk dies from a filesystem with "single"
allocation, you lose the fewest number of files).

   On the other hand, this is probably going to be one of those things
that could have really nasty performance effects. It's also somewhat
beyond my knowledge right now, so someone else will have to look at
it. :)

> >> 4:
> >> 
> >> Further thought: For SSDs, is the "minimise heads movement"
> >> 'staircase' code bypassed so as to speed up allocation for the
> >> "don't care" addressing (near zero seek time) of SSDs?
> > 
> > I think this is more to do with the behaviour of the block layer 
> > than the FS. There are alternative elevators that can be used, but
> > I don't know how to configure them (or whether they need
> > configuring at all).
> 
> Regardless of the block level io schedulers, does not btrfs determine
> the LBA allocation?...
> 
> For example, if for an SSD, the next free space allocation for
> whatever is to be newly written could become more like a log based
> round-robin allocation across the entire SSD (NILFS-like?) rather than
> trying to localise data to minimise the physical head movement as for
> a HDD.
> 
> Or is there no useful gain with that over simply using the same one
> lump of allocator code as for HDDs?

   No idea. It's going to need someone to write the code and benchmark
the options, I suspect.

> > On the other hand, it's entirely possible that something else will 
> > go wrong and things will blow up. My guess is that unless you have
> [...]
> 
> My worry for moving up to spreading a filesystem across multiple disk
> packs is for when the disk pack hardware itself fails taking out all
> four disks...
> 
> (And there's always the worry of the esata lead getting yanked to take
> out all four disks...)

   As I said, I've done the latter myself. The array *should* go into
read-only mode, preventing any damage. You will then need to unmount
the whole thing, ensure that the original disks are all back in place
and detected again, and remount the FS. In theory

Re: RAID device nomination (Feature request)

2013-04-18 Thread Martin
On 18/04/13 15:06, Hugo Mills wrote:
> On Thu, Apr 18, 2013 at 02:45:24PM +0100, Martin wrote:
>> Dear Devs,
>> 
>> I have a number of esata disk packs holding 4 physical disks each
>> where I wish to use the disk packs aggregated for 16TB and up to
>> 64TB backups...
>> 
>> Can btrfs...?
>> 
>> 1:
>> 
>> Mirror data such that there is a copy of data on each *disk pack*
>> ?
>> 
>> Note that esata shows just the disks as individual physical
>> disks, 4 per disk pack. Can physical disks be grouped together to
>> force the RAID data to be mirrored across all the nominated
>> groups?
> 
> Interesting you should ask this: I realised quite recently that 
> this could probably be done fairly easily with a modification to
> the chunk allocator.

Hey, that sounds good. And easy? ;-)

Possible?...


>> 2:
>> 
>> Similarly for a mix of different storage technologies such as 
>> manufacturer or type (SSD/HDD), can the disks be grouped to
>> ensure a copy of the data is replicated across all the groups?
>> 
>> For example, I deliberately buy HDDs from different 
>> batches/manufacturers to try to avoid common mode or similarly
>> timed failures. Can btrfs be guided to safely spread the RAID
>> data across the *different* hardware types/batches?
> 
> From the kernel point of view, this is the same question as the 
> previous one.

Indeed so.

The question is how the groups of disks are determined:

Manually by the user for mkfs.btrfs and/or specified when disks are
added/replaced;

Or somehow automatically detected (but with a user override).


Have a "disk group" UUID for a group of disks similar to that done for
md-raid?



>> 3:
>> 
>> Also, for different speeds of disks, can btrfs tune itself to
>> balance the read/writes accordingly?
> 
> Not that I'm aware of.

A 'nice to have' would be some sort of read-access load balancing with
options to balance latency or queue depth... Could btrfs do that
independently (complimentary with) of the block layer schedulers?


>> 4:
>> 
>> Further thought: For SSDs, is the "minimise heads movement"
>> 'staircase' code bypassed so as to speed up allocation for the
>> "don't care" addressing (near zero seek time) of SSDs?
> 
> I think this is more to do with the behaviour of the block layer 
> than the FS. There are alternative elevators that can be used, but
> I don't know how to configure them (or whether they need
> configuring at all).

Regardless of the block level io schedulers, does not btrfs determine
the LBA allocation?...

For example, if for an SSD, the next free space allocation for
whatever is to be newly written could become more like a log based
round-robin allocation across the entire SSD (NILFS-like?) rather than
trying to localise data to minimise the physical head movement as for
a HDD.

Or is there no useful gain with that over simply using the same one
lump of allocator code as for HDDs?


> You have backups, which is good. Keep up with the latest kernels 
> from kernel.org. The odds of you hitting something major are
> small, but non-zero. One thing that's probably fairly likely with
> your setup

Healthy paranoia is good ;-)


[...]
> So with light home use on a largeish array, I've had a number of 
> cockups recently that were recoverable, albeit with some swearing.

Thanks for the notes.


> On the other hand, it's entirely possible that something else will 
> go wrong and things will blow up. My guess is that unless you have
[...]

My worry for moving up to spreading a filesystem across multiple disk
packs is for when the disk pack hardware itself fails taking out all
four disks...

(And there's always the worry of the esata lead getting yanked to take
out all four disks...)


Thanks,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID device nomination (Feature request)

2013-04-18 Thread Hugo Mills
On Thu, Apr 18, 2013 at 02:45:24PM +0100, Martin wrote:
> Dear Devs,
> 
> I have a number of esata disk packs holding 4 physical disks each where
> I wish to use the disk packs aggregated for 16TB and up to 64TB backups...
> 
> Can btrfs...?
> 
> 1:
> 
> Mirror data such that there is a copy of data on each *disk pack* ?
> 
> Note that esata shows just the disks as individual physical disks, 4 per
> disk pack. Can physical disks be grouped together to force the RAID data
> to be mirrored across all the nominated groups?

   Interesting you should ask this: I realised quite recently that
this could probably be done fairly easily with a modification to the
chunk allocator.

> 2:
> 
> Similarly for a mix of different storage technologies such as
> manufacturer or type (SSD/HDD), can the disks be grouped to ensure a
> copy of the data is replicated across all the groups?
> 
> For example, I deliberately buy HDDs from different
> batches/manufacturers to try to avoid common mode or similarly timed
> failures. Can btrfs be guided to safely spread the RAID data across the
> *different* hardware types/batches?

   From the kernel point of view, this is the same question as the
previous one.

> 3:
> 
> Also, for different speeds of disks, can btrfs tune itself to balance
> the read/writes accordingly?

   Not that I'm aware of.

> 4:
> 
> Further thought: For SSDs, is the "minimise heads movement" 'staircase'
> code bypassed so as to speed up allocation for the "don't care"
> addressing (near zero seek time) of SSDs?

   I think this is more to do with the behaviour of the block layer
than the FS. There are alternative elevators that can be used, but I
don't know how to configure them (or whether they need configuring at
all).

> And then again: Is 64TBytes of btrfs a good idea in the first place?!
> 
> (There's more than one physical set of backups but I'd rather not suffer
> weeks to recover from one hiccup in the filesystem... Should I partition
> btrfs down to smaller gulps, or does the structure of btrfs in effect
> already do that?)

   You have backups, which is good. Keep up with the latest kernels
from kernel.org. The odds of you hitting something major are small,
but non-zero. One thing that's probably fairly likely with your setup
is accidental disconnection of a disk or block of disks. Having
duplicate data is really quite handy in that instance -- if you lose
one device and reinsert it, you can recover easily with a scrub (I've
done that). If you lose multiple devices in a block, then the FS will
probably go read-only and stop any further damage from being done
until it can be unmounted and the hardware reassembled (I've done this
too(+), with half of my 10 TB(*) array).

   So with light home use on a largeish array, I've had a number of
cockups recently that were recoverable, albeit with some swearing.

   On the other hand, it's entirely possible that something else will
go wrong and things will blow up. My guess is that unless you have
really dodgy hardware that keeps screwing stuff up, you _probably_
won't have to restore from backup. But it still may happen. It's
really hard to put figures on it, because (a) we don't have figures on
how many people actually use the FS, and (b) we don't have many that
we're aware of working at the >10TB level.

   Hugo.

(+) eSATA connectors look regrettably similar to HDMI connectors in
the half-light under the desk.
(*) 5 TB after RAID-1.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- But somewhere along the line, it seems / That pimp became ---
   cool,  and punk mainstream.   


signature.asc
Description: Digital signature