Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-28 Thread Richard Elling
On Oct 28, 2012, at 5:10 AM, Robin Axelsson  
wrote:
> On 2012-10-24 21:58, Timothy Coalson wrote:
>> On Wed, Oct 24, 2012 at 6:17 AM, Robin Axelsson<
>> gu99r...@student.chalmers.se>  wrote:
>>> It would be interesting to know how you convert a raidz2 stripe to say a
>>> raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra
>>> parity drive by converting it to a raidz3 pool.  I'm imagining that would
>>> be like creating a raidz1 pool on top of the leaf vdevs that constitutes
>>> the raidz2 pool and the new leaf vdev which results in an additional parity
>>> drive. It doesn't sound too difficult to do that. Actually, this way you
>>> could even get raidz4 or raidz5 pools. Question is though, how things would
>>> pan out performance wise, I would imagine that a 55 drive raidz25 pool is
>>> really taxing on the CPU.
>>> 
>> Multiple parity is more complicated than that, an additional xor device (a
>> la traditional raid4) would end up with zeros everywhere, and couldn't
>> reconstruct your data from an additional failure.  Look at "computing
>> parity" in http://en.wikipedia.org/wiki/Raid_6#RAID_6 .  While in theory it
>> can extend to more than 3 parity blocks, it is unclear whether more than 3
>> will offer any serious additional benefits (using multiple raidz2 vdevs can
>> give you better IOPS than larger raidz3 vdevs, with little change in raw
>> space efficiency).  There are also combinatorial implications to multiple
>> bit errors in a single data chunk with high parity levels, but that is
>> somewhat unlikely.
> 
> XOR you say? I didn't know that raidz used xor for parity. I thought they 
> used some kind of a Reed-Solomon implementation à la PAR2 on the block level 
> to achieve "RAID like" functionality. It never was stated from what I could 
> read in the documentation that the raid functionality was implemented like 
> traditional hardware RAID. If xor is the case then I'm curious as to how they 
> managed to pull off a raidz3 implementation with three disk redundancy.

The first parity is XOR (also a Reed-Solomon syndrome). The 2nd and 3rd 
parity are other syndromes.

Also, minor nit: there is no such thing as hardware RAID, there is only 
software RAID.
 -- richard

> 
> Maybe a good read into the zpool source code would help clarifying things...
> 
>> 
>> Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a
>>> no-brainer; you just remove one drive from the pool and force zpool to
>>> accept the new state as "normal".
>>> 
>> A degraded raidz2 vdev has to compute the missing block from parity on
>> nearly every read, this is not the normal state of raidz1.  Changing the
>> parity level, either up or down, has similar complications in the on-disk
>> structure.
>> 
>> But expanding a raidz pool with additional storage while preserving the
>>> parity structure sounds a little bit trickier. I don't think I have that
>>> knowledge to write a bpr rewriter although I'm reading Solaris Internals
>>> right now ;)
>> 
>> Unless raidz* did something radically different than raid5/6 (as in, not
>> having the parity blocks necessarily next to each other in the data chunk,
>> and having their positions recorded in the data chunk itself), the position
>> of the parity and data blocks would change.  The "always consistent on
>> disk" approach of ZFS adds additional problems to this, which probably make
>> it impossible to rewrite the re-parity'ed chunk over the old chunk, meaning
>> it has to find some free space every time it wants to update a chunk to the
>> new parity level.
>> 
>> 
 What you describe here is known as unionfs in Linux, among others.
 I think there were RFEs or otherwise expressed desires to make that
 in Solaris and later illumos (I did campaign for that sometime ago),
 but AFAIK this was not yet done by anyone.
 
  YES, UnionFS-like functionality is what I was talking about. It seems
>>> like it has been abandoned in favor of AuFS in the Linux and the BSD world.
>>> It seems to have functions that are a little overkill to use with zfs, such
>>> as copy-on-write. Perhaps a more simplistic implementation of it would be
>>> more suitable for zfs.
>>> 
>> You could create zfs filesystems for subfolders in your "dataset" from the
>> separate pools, and give them mountpoints that put them into the same
>> directory.  You would have to balance the data allocation between the pools
>> manually, though.
> 
> I know that works but I was talking about having files stored at different 
> (hardware) locations and yet being in the same ... folder, I guess you are 
> using MacOS :)
> 
>> 
>> Perhaps a similar functionality can be established through an abstraction
>>> layer behind network shares.
>>> 
>>> In Windows this functionality is called 'disk pooling', btw.
>> 
>> In ZFS, disk pooling is done by "creating a zpool", emphasis on singular.
>>  Do you actually expect a large portion of your disks to go offline
>> suddenly?  I don't see a g

Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-28 Thread Jan Owoc
On Sun, Oct 28, 2012 at 6:42 AM, Robin Axelsson
 wrote:
> As I understand, fragmentation occurs when you remove files and rewrite
> files to a storage pool a large enough number of times. So if I have a
> fragmented storage pool and copy all files to a new empty storage pool, the
> new storage pool would not be fragmented. So I take it what you are saying
> is that the way the files are organized in the new storage pool is sometimes
> *worse* than the way they are organized in the old fragmented storage pool?

Robin, I think you missed this part by Jim:

On Thu, Oct 25, 2012 at 4:21 AM, Jim Klimov  wrote:
> This was discussed several times, with the outcome being that with
> ZFS's data allocation policies, there is no one good defrag policy.
> The two most popular options are about storing the current "live"
> copy of a file contiguously (as opposed to its history of released
> blocks only referenced in snapshots) vs. storing pool blocks in
> ascending creation-TXG order (to arguably speed up scrubs and
> resilvers, which can consume a noticeable portion of performance
> doing random IO).
>
> Usual users mostly think that the first goal is good - however,
> if you add clones and dedup into the equation, it might never be
> possible to retain their benefits AND store all files contiguously.

There are two important points:
1) if 'dedup=on' or you have clones, there is no way to defragment the
files unless you turn off dedup, and make two entirely separate
filesystems for your clones...
2) as soon as your newly-created filesystem has a snapshot and you
change some data, there are at least two 'right' answers for what
should be 'contiguous': the original snapshot, or the current 'live'
filesystem, with good arguments on both sides. You seem to feel the
latter, while for my usage I feel the former (see... even we can't
agree :-)).

Jan

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-28 Thread Robin Axelsson

On 2012-10-25 12:21, Jim Klimov wrote:

2012-10-24 15:17, Robin Axelsson wrote:

On 2012-10-23 20:06, Jim Klimov wrote:

2012-10-23 19:53, Robin Axelsson wrote:

...

But if I do send/receive to the same pool I will need to have enough
free space in it to fit at least two copies of the dataset I want to
reallocate.


Likewise with "reallocation" of files - though the unit of required
space would be smaller.



It seems like what zfs is missing here is a good defrag tool.


This was discussed several times, with the outcome being that with
ZFS's data allocation policies, there is no one good defrag policy.
The two most popular options are about storing the current "live"
copy of a file contiguously (as opposed to its history of released
blocks only referenced in snapshots) vs. storing pool blocks in
ascending creation-TXG order (to arguably speed up scrubs and
resilvers, which can consume a noticeable portion of performance
doing random IO).

Usual users mostly think that the first goal is good - however,
if you add clones and dedup into the equation, it might never be
possible to retain their benefits AND store all files contiguously.

Also, as with other matters of moving blocks around in the allocation
areas and transparently to other layers of the system (that is, on a
live system that actively does I/O while you defrag data), there are
some other problems that I'm not very qualified to speculate about,
that are deemed to be solvable by the generic BPR.

Still, I do think that many of the problems postponed until the time
that BPR arrives, can be solved with different methods and limitations
(such as off-line mangling of data on the pool) which might still be
acceptable to some use-cases.

All-in-all, the main intended usage of ZFS is on relatively powerful
enterprise-class machines, where much of the needed data is cached
on SSD or in huge RAM, so random HDD IO lags become less relevant.
This situation is most noticeable with deduplication, which in ZFS
implementation requires vast resources to basically function.
With market prices going down over time, it is more likely to see
home-NAS boxes tomorrow similarly spec'ed to enterprise servers of
today, than to see the core software fundamentally revised and
rearchitected for boxes of yesterday. After all, even in open-source
world, developers need to eat and feed their families, so commercial
applicability does matter and does influence the engineering designs
and trade-offs.




Actually I have refrained from using dedup and compression as I somehow 
feel that the risks and penalties that come with them outweigh the 
benefits, at least for my purposes.


It should also be noted that in a lot of defrag software out there you 
have the option to choose different policies according to which you can 
reorganize the data on a disk. A tool that continuously monitors disk 
activity could give information as to what policy would be optimal for a 
particular disk configuration.


As I understand, fragmentation occurs when you remove files and rewrite 
files to a storage pool a large enough number of times. So if I have a 
fragmented storage pool and copy all files to a new empty storage pool, 
the new storage pool would not be fragmented. So I take it what you are 
saying is that the way the files are organized in the new storage pool 
is sometimes *worse* than the way they are organized in the old 
fragmented storage pool?





It would be interesting to know how you convert a raidz2 stripe to say a
raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an
extra parity drive by converting it to a raidz3 pool.  I'm imagining
that would be like creating a raidz1 pool on top of the leaf vdevs that
constitutes the raidz2 pool and the new leaf vdev which results in an
additional parity drive. It doesn't sound too difficult to do that.
Actually, this way you could even get raidz4 or raidz5 pools. Question
is though, how things would pan out performance wise, I would imagine
that a 55 drive raidz25 pool is really taxing on the CPU.

Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a
no-brainer; you just remove one drive from the pool and force zpool to
accept the new state as "normal".

But expanding a raidz pool with additional storage while preserving the
parity structure sounds a little bit trickier. I don't think I have that
knowledge to write a bpr rewriter although I'm reading Solaris Internals
right now ;)


Read also the ZFS On-Disk Specification (the one I saw is somewhat
outdated, being from 2006, but most concepts and data structures
are the foundation - expected to remain in place and be expanded upon).

In short, if I got that all right, the leaf components of a top-level
VDEV are striped upon creation and declared as an allocation area with
its ID and monotonous offsets of sectors (and further subdivided into
a couple hundred SPAs to reduce seeking). For example, on a 5-disk
array the offsets of the pooled sectors might look

Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-28 Thread Robin Axelsson
Yes, this sounds like an interesting solution but it doesn't seem that 
SAMFS or SAM-QFS is implemented in OI, I could be wrong. The 
documentation for that functionality don't seem to be accessible 
anymore. Access to docs.sun.com is redirected to a standard page on the 
Oracle website.


On 2012-10-25 10:51, Jim Klimov wrote:

2012-10-24 23:58, Timothy Coalson wrote:
>  I doubt I would like the outcome of having
> some software make arbitrary decisions of what real filesystem each
> to put file on, and then having one filesystem fail, so if you really
> expect this, you may be happier keeping the two pools separate and
> deciding where to put stuff yourself (since if you are expecting a
> set of disks to fail, I expect you would have some idea as to which
> ones it would be, for instance an external enclosure).

This to an extent sounds similar (doable with) hierarchical storage
management, such as Sun's SAMFS/QFS solution. Essentially, this is
a (virtual) filesystem where you set up storage rules based on last
access times and frequencies, data types, etc. and where you have
many tiers of storage (ranging from fast, small, expensive to slow,
bulky, cheap), such as SSD Arrays - 15K SAS arrays - 7.2 SATA - Tape.

New incoming data ends up on the fast tier. Old stale data lives on
tapes. Data used sometimes migrates between tiers. The rules you
define for the HSM system regulate how many copies on which tier
you'd store, so loss of some devices should not be fatal - as well
as cleaning up space on the faster tier to receive new data or to
cache the old data requested by users and fetched from slower tiers.


I did propose to add some HSM-type capabilities to ZFS, mostly with
the goals of power-saving on home-NAS machines, so that the box could
live with a couple of active disks (i.e. rpool and the "active-data"
part of the data pool) while most of the data pool's disks can remain
spun-down. Whenever a user reads some data from the pool (watching
a movie or listening to music or processing his photos) the system
would prefetch the data (perhaps a folder with MP3's) onto the cache
disks and let the big ones spin down - with a home NAS and few users
it is likely that if you're watching a movie, you system is otherwise
unused for a couple of hours.

Likewise, and this happens to be the trickier part, new writes to the
data pool should go to the active disks and occasionally sync to and
spread over the main pool disk.

I hoped this can be all done transparently to users within ZFS, but
overall discussions led to conclusion that this can better be done
not within ZFS, but with some daemons (perhaps a dtrace-abusing script)
doing the data migration and abstraction (the transparency to users).
Besides, with introduction and advances in generic L2ARC, and with
the possibility of file-level prefetch, much of that discussion became
moot ;)

Hope this small historical insight helps you :)
//Jim Klimov


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss

.





___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-28 Thread Robin Axelsson

On 2012-10-24 21:58, Timothy Coalson wrote:

On Wed, Oct 24, 2012 at 6:17 AM, Robin Axelsson<
gu99r...@student.chalmers.se>  wrote:

It would be interesting to know how you convert a raidz2 stripe to say a
raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra
parity drive by converting it to a raidz3 pool.  I'm imagining that would
be like creating a raidz1 pool on top of the leaf vdevs that constitutes
the raidz2 pool and the new leaf vdev which results in an additional parity
drive. It doesn't sound too difficult to do that. Actually, this way you
could even get raidz4 or raidz5 pools. Question is though, how things would
pan out performance wise, I would imagine that a 55 drive raidz25 pool is
really taxing on the CPU.


Multiple parity is more complicated than that, an additional xor device (a
la traditional raid4) would end up with zeros everywhere, and couldn't
reconstruct your data from an additional failure.  Look at "computing
parity" in http://en.wikipedia.org/wiki/Raid_6#RAID_6 .  While in theory it
can extend to more than 3 parity blocks, it is unclear whether more than 3
will offer any serious additional benefits (using multiple raidz2 vdevs can
give you better IOPS than larger raidz3 vdevs, with little change in raw
space efficiency).  There are also combinatorial implications to multiple
bit errors in a single data chunk with high parity levels, but that is
somewhat unlikely.


XOR you say? I didn't know that raidz used xor for parity. I thought 
they used some kind of a Reed-Solomon implementation à la PAR2 on the 
block level to achieve "RAID like" functionality. It never was stated 
from what I could read in the documentation that the raid functionality 
was implemented like traditional hardware RAID. If xor is the case then 
I'm curious as to how they managed to pull off a raidz3 implementation 
with three disk redundancy.


Maybe a good read into the zpool source code would help clarifying things...



Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a

no-brainer; you just remove one drive from the pool and force zpool to
accept the new state as "normal".


A degraded raidz2 vdev has to compute the missing block from parity on
nearly every read, this is not the normal state of raidz1.  Changing the
parity level, either up or down, has similar complications in the on-disk
structure.

But expanding a raidz pool with additional storage while preserving the

parity structure sounds a little bit trickier. I don't think I have that
knowledge to write a bpr rewriter although I'm reading Solaris Internals
right now ;)


Unless raidz* did something radically different than raid5/6 (as in, not
having the parity blocks necessarily next to each other in the data chunk,
and having their positions recorded in the data chunk itself), the position
of the parity and data blocks would change.  The "always consistent on
disk" approach of ZFS adds additional problems to this, which probably make
it impossible to rewrite the re-parity'ed chunk over the old chunk, meaning
it has to find some free space every time it wants to update a chunk to the
new parity level.



What you describe here is known as unionfs in Linux, among others.
I think there were RFEs or otherwise expressed desires to make that
in Solaris and later illumos (I did campaign for that sometime ago),
but AFAIK this was not yet done by anyone.

  YES, UnionFS-like functionality is what I was talking about. It seems

like it has been abandoned in favor of AuFS in the Linux and the BSD world.
It seems to have functions that are a little overkill to use with zfs, such
as copy-on-write. Perhaps a more simplistic implementation of it would be
more suitable for zfs.


You could create zfs filesystems for subfolders in your "dataset" from the
separate pools, and give them mountpoints that put them into the same
directory.  You would have to balance the data allocation between the pools
manually, though.


I know that works but I was talking about having files stored at 
different (hardware) locations and yet being in the same ... folder, I 
guess you are using MacOS :)




Perhaps a similar functionality can be established through an abstraction

layer behind network shares.

In Windows this functionality is called 'disk pooling', btw.


In ZFS, disk pooling is done by "creating a zpool", emphasis on singular.
  Do you actually expect a large portion of your disks to go offline
suddenly?  I don't see a good way to handle this (good meaning there are no
missing files under the expected error conditions) that gets you more than
50% of your raw storage capacity (mirrors across the boundary of what you
expect to go down together).  I doubt I would like the outcome of having
some software make arbitrary decisions of what real filesystem to put each
file on, and then having one filesystem fail, so if you really expect this,
you may be happier keeping the two pools separate and deciding where to put
stuff yourself (s

Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-25 Thread Jim Klimov

2012-10-24 15:17, Robin Axelsson wrote:

On 2012-10-23 20:06, Jim Klimov wrote:

2012-10-23 19:53, Robin Axelsson wrote:

...

But if I do send/receive to the same pool I will need to have enough
free space in it to fit at least two copies of the dataset I want to
reallocate.


Likewise with "reallocation" of files - though the unit of required
space would be smaller.



It seems like what zfs is missing here is a good defrag tool.


This was discussed several times, with the outcome being that with
ZFS's data allocation policies, there is no one good defrag policy.
The two most popular options are about storing the current "live"
copy of a file contiguously (as opposed to its history of released
blocks only referenced in snapshots) vs. storing pool blocks in
ascending creation-TXG order (to arguably speed up scrubs and
resilvers, which can consume a noticeable portion of performance
doing random IO).

Usual users mostly think that the first goal is good - however,
if you add clones and dedup into the equation, it might never be
possible to retain their benefits AND store all files contiguously.

Also, as with other matters of moving blocks around in the allocation
areas and transparently to other layers of the system (that is, on a
live system that actively does I/O while you defrag data), there are
some other problems that I'm not very qualified to speculate about,
that are deemed to be solvable by the generic BPR.

Still, I do think that many of the problems postponed until the time
that BPR arrives, can be solved with different methods and limitations
(such as off-line mangling of data on the pool) which might still be
acceptable to some use-cases.

All-in-all, the main intended usage of ZFS is on relatively powerful
enterprise-class machines, where much of the needed data is cached
on SSD or in huge RAM, so random HDD IO lags become less relevant.
This situation is most noticeable with deduplication, which in ZFS
implementation requires vast resources to basically function.
With market prices going down over time, it is more likely to see
home-NAS boxes tomorrow similarly spec'ed to enterprise servers of
today, than to see the core software fundamentally revised and
rearchitected for boxes of yesterday. After all, even in open-source
world, developers need to eat and feed their families, so commercial
applicability does matter and does influence the engineering designs
and trade-offs.




It would be interesting to know how you convert a raidz2 stripe to say a
raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an
extra parity drive by converting it to a raidz3 pool.  I'm imagining
that would be like creating a raidz1 pool on top of the leaf vdevs that
constitutes the raidz2 pool and the new leaf vdev which results in an
additional parity drive. It doesn't sound too difficult to do that.
Actually, this way you could even get raidz4 or raidz5 pools. Question
is though, how things would pan out performance wise, I would imagine
that a 55 drive raidz25 pool is really taxing on the CPU.

Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a
no-brainer; you just remove one drive from the pool and force zpool to
accept the new state as "normal".

But expanding a raidz pool with additional storage while preserving the
parity structure sounds a little bit trickier. I don't think I have that
knowledge to write a bpr rewriter although I'm reading Solaris Internals
right now ;)


Read also the ZFS On-Disk Specification (the one I saw is somewhat
outdated, being from 2006, but most concepts and data structures
are the foundation - expected to remain in place and be expanded upon).

In short, if I got that all right, the leaf components of a top-level
VDEV are striped upon creation and declared as an allocation area with
its ID and monotonous offsets of sectors (and further subdivided into
a couple hundred SPAs to reduce seeking). For example, on a 5-disk
array the offsets of the pooled sectors might look like this:

  0  1  2  3  4
  5  6  7  8  9
...

(For the purposes of offset numbering, sector size is 512b - even on
4Kb-sectored disks; I am not sure how that's processed in the address
math - likely the ashift value helps pick the specific disk's number).

Then when a piece of data is saved by the OS (kernel metadata or
userland userdata), this is logically combined into a block, processed
for storage (compressed, etc.) and depending on redundancy level some
sectors with parity data are prepended to each set of data sectors.
For a raidz2 of 5 disks you'd have 2 parity (P, p, b) and up to 3 data
sectors (D, d, k) as in the example below:

  P0 P1 D0 D1 D2
  P2 P3 D3 D4 p0
  p1 d0 b0 b1 k0
  k1 k2 ...

ZFS allocates only as many sectors as are needed to store the
redundancy and data for the block, so the data (and holes after
removal of data) are not very predictably intermixed - as would
be the case with traditional full-stripe RAID5/6. Still, this
does allow recovery from the loss 

Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-25 Thread Jim Klimov

2012-10-24 23:58, Timothy Coalson wrote:
>  I doubt I would like the outcome of having
> some software make arbitrary decisions of what real filesystem each
> to put file on, and then having one filesystem fail, so if you really
> expect this, you may be happier keeping the two pools separate and
> deciding where to put stuff yourself (since if you are expecting a
> set of disks to fail, I expect you would have some idea as to which
> ones it would be, for instance an external enclosure).

This to an extent sounds similar (doable with) hierarchical storage
management, such as Sun's SAMFS/QFS solution. Essentially, this is
a (virtual) filesystem where you set up storage rules based on last
access times and frequencies, data types, etc. and where you have
many tiers of storage (ranging from fast, small, expensive to slow,
bulky, cheap), such as SSD Arrays - 15K SAS arrays - 7.2 SATA - Tape.

New incoming data ends up on the fast tier. Old stale data lives on
tapes. Data used sometimes migrates between tiers. The rules you
define for the HSM system regulate how many copies on which tier
you'd store, so loss of some devices should not be fatal - as well
as cleaning up space on the faster tier to receive new data or to
cache the old data requested by users and fetched from slower tiers.


I did propose to add some HSM-type capabilities to ZFS, mostly with
the goals of power-saving on home-NAS machines, so that the box could
live with a couple of active disks (i.e. rpool and the "active-data"
part of the data pool) while most of the data pool's disks can remain
spun-down. Whenever a user reads some data from the pool (watching
a movie or listening to music or processing his photos) the system
would prefetch the data (perhaps a folder with MP3's) onto the cache
disks and let the big ones spin down - with a home NAS and few users
it is likely that if you're watching a movie, you system is otherwise
unused for a couple of hours.

Likewise, and this happens to be the trickier part, new writes to the
data pool should go to the active disks and occasionally sync to and
spread over the main pool disk.

I hoped this can be all done transparently to users within ZFS, but
overall discussions led to conclusion that this can better be done
not within ZFS, but with some daemons (perhaps a dtrace-abusing script)
doing the data migration and abstraction (the transparency to users).
Besides, with introduction and advances in generic L2ARC, and with
the possibility of file-level prefetch, much of that discussion became
moot ;)

Hope this small historical insight helps you :)
//Jim Klimov


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-24 Thread Timothy Coalson
On Wed, Oct 24, 2012 at 6:17 AM, Robin Axelsson <
gu99r...@student.chalmers.se> wrote:
>
> It would be interesting to know how you convert a raidz2 stripe to say a
> raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra
> parity drive by converting it to a raidz3 pool.  I'm imagining that would
> be like creating a raidz1 pool on top of the leaf vdevs that constitutes
> the raidz2 pool and the new leaf vdev which results in an additional parity
> drive. It doesn't sound too difficult to do that. Actually, this way you
> could even get raidz4 or raidz5 pools. Question is though, how things would
> pan out performance wise, I would imagine that a 55 drive raidz25 pool is
> really taxing on the CPU.
>

Multiple parity is more complicated than that, an additional xor device (a
la traditional raid4) would end up with zeros everywhere, and couldn't
reconstruct your data from an additional failure.  Look at "computing
parity" in http://en.wikipedia.org/wiki/Raid_6#RAID_6 .  While in theory it
can extend to more than 3 parity blocks, it is unclear whether more than 3
will offer any serious additional benefits (using multiple raidz2 vdevs can
give you better IOPS than larger raidz3 vdevs, with little change in raw
space efficiency).  There are also combinatorial implications to multiple
bit errors in a single data chunk with high parity levels, but that is
somewhat unlikely.

Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a
> no-brainer; you just remove one drive from the pool and force zpool to
> accept the new state as "normal".
>

A degraded raidz2 vdev has to compute the missing block from parity on
nearly every read, this is not the normal state of raidz1.  Changing the
parity level, either up or down, has similar complications in the on-disk
structure.

But expanding a raidz pool with additional storage while preserving the
> parity structure sounds a little bit trickier. I don't think I have that
> knowledge to write a bpr rewriter although I'm reading Solaris Internals
> right now ;)


Unless raidz* did something radically different than raid5/6 (as in, not
having the parity blocks necessarily next to each other in the data chunk,
and having their positions recorded in the data chunk itself), the position
of the parity and data blocks would change.  The "always consistent on
disk" approach of ZFS adds additional problems to this, which probably make
it impossible to rewrite the re-parity'ed chunk over the old chunk, meaning
it has to find some free space every time it wants to update a chunk to the
new parity level.


>> What you describe here is known as unionfs in Linux, among others.
>> I think there were RFEs or otherwise expressed desires to make that
>> in Solaris and later illumos (I did campaign for that sometime ago),
>> but AFAIK this was not yet done by anyone.
>>
>>  YES, UnionFS-like functionality is what I was talking about. It seems
> like it has been abandoned in favor of AuFS in the Linux and the BSD world.
> It seems to have functions that are a little overkill to use with zfs, such
> as copy-on-write. Perhaps a more simplistic implementation of it would be
> more suitable for zfs.
>

You could create zfs filesystems for subfolders in your "dataset" from the
separate pools, and give them mountpoints that put them into the same
directory.  You would have to balance the data allocation between the pools
manually, though.

Perhaps a similar functionality can be established through an abstraction
> layer behind network shares.
>
> In Windows this functionality is called 'disk pooling', btw.


In ZFS, disk pooling is done by "creating a zpool", emphasis on singular.
 Do you actually expect a large portion of your disks to go offline
suddenly?  I don't see a good way to handle this (good meaning there are no
missing files under the expected error conditions) that gets you more than
50% of your raw storage capacity (mirrors across the boundary of what you
expect to go down together).  I doubt I would like the outcome of having
some software make arbitrary decisions of what real filesystem to put each
file on, and then having one filesystem fail, so if you really expect this,
you may be happier keeping the two pools separate and deciding where to put
stuff yourself (since if you are expecting a set of disks to fail, I expect
you would have some idea as to which ones it would be, for instance an
external enclosure).

If, on the other hand, you don't expect your hardware to drop an entire set
of disks for no good reason, making them into one large storage pool and
putting your filesystem in it will share your data transparently across all
disks without needing to set anything else up.

Tim
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-24 Thread Robin Axelsson

On 2012-10-23 20:06, Jim Klimov wrote:

2012-10-23 19:53, Robin Axelsson wrote:

That sounds like a good point, unless you first scan for hard links and
avoid touching the files and their hard links in the shell script, I 
guess.


I guess the idea about reading into memory and writing back into the 
same file (or "cat $SRC > /var/tmp/$SRC && cat /var/tmp/$SRC > $SRC"

to be on the safer side) should take care of hardlinks, since the
inode would stay the same. You should of course ensure that nobody
uses the file in question (i.e. databases are down, etc). You can
also keep track of "rebalanced" inode numbers to avoid processing
hardlinked files more than once.

ZFS send/recv should also take care of these things, and with
sufficient space in the pool to ensure "even" writes (i.e. just
after expansion with new VDEVs) it can be done within the pool if
you don't have a spare one. Then you can ensure all needed "local"
dataset properties are transfered, remove the old dataset and
rename the new copy to its name (likewise for hierarchies of
datasets).

But if I do send/receive to the same pool I will need to have enough 
free space in it to fit at least two copies of the dataset I want to 
reallocate.


But I heard that a pool that is almost full have some performance
issues, especially when you try to delete files from that pool. But
maybe this becomes a non-issue once the pool is expanded by another 
vdev.


This issue may remain - basically, when a pool is nearly full (YMMV,
empirically over 80-90% for pools with many write-delete cycles,
but there were reports of even 60% full being a problem), its block
allocation may look like good cheese with many tiny holes. Walking
the free space to find a hole big enough to write a new block takes
time, hence the slowdown. When you expand the pool with a new vdev,
the old full cheesy one does not go away, and writes that ZFS pipe
line intended to put there would still lag (and may now time out and
may get to another vdev, as someone else mentioned in this thread).



It seems like what zfs is missing here is a good defrag tool.



To answer your other letters,

> But if I have two raidz3 vdevs, is there any way to create an
> isolation/separation between them so that if one of them fails, only 
the

> data that is stored within that vdev will be lost and all data that
> happen to be stored in the other can be recovered? And yet let them 
both

> be accessible from the same path?
>
> The only thing that needs to be sorted out is where the files should go
> when you write to that path and avoid splitting such that one half if
> the file goes to one vdev and another goes to the other vdev. Maybe
> there is some disk or i/o scheduler that can handle such operations?

You can't do that. A pool is one whole (you can't also remove vdevs
from it and you can't change or reduce raidzN groups' redundancy -
may be that will change after the long-awaited BPR = block-pointer
rewriter is implemented by some kind samaritan), and as soon as it
is set up or expanded all writes go striped to all components and
all top-level components are required not-failed to import the pool
and use it.

It would be interesting to know how you convert a raidz2 stripe to say a 
raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an 
extra parity drive by converting it to a raidz3 pool.  I'm imagining 
that would be like creating a raidz1 pool on top of the leaf vdevs that 
constitutes the raidz2 pool and the new leaf vdev which results in an 
additional parity drive. It doesn't sound too difficult to do that. 
Actually, this way you could even get raidz4 or raidz5 pools. Question 
is though, how things would pan out performance wise, I would imagine 
that a 55 drive raidz25 pool is really taxing on the CPU.


Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a 
no-brainer; you just remove one drive from the pool and force zpool to 
accept the new state as "normal".


But expanding a raidz pool with additional storage while preserving the 
parity structure sounds a little bit trickier. I don't think I have that 
knowledge to write a bpr rewriter although I'm reading Solaris Internals 
right now ;)



> I can't see how a dataset can span over several zpools as you usually
> create it with mypool/datasetname (in the case of a file system
> dataset). But I can see several datasets in one pool though (e.g.
> mypool/dataset1, mypool/dataset2 ...). So the relationship I see is 
pool

> *onto* dataset.

It can't. A dataset is contained in one pool. Many datasets can
be contained in one pool and share the free space, dedup table and
maybe some other resources. Datasets contained in different pools
are unrelated.

> But if I have two separate pools with separate names, say mypool1 and
> mypool2 I could create a zfs file system dataset with the same name in
> each of these pools and then give these two datasets the same
> "mountpoint" property couldn't I? Then they would be forced 

Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-23 Thread Jim Klimov

2012-10-23 19:53, Robin Axelsson wrote:

That sounds like a good point, unless you first scan for hard links and
avoid touching the files and their hard links in the shell script, I guess.


I guess the idea about reading into memory and writing back into the 
same file (or "cat $SRC > /var/tmp/$SRC && cat /var/tmp/$SRC > $SRC"

to be on the safer side) should take care of hardlinks, since the
inode would stay the same. You should of course ensure that nobody
uses the file in question (i.e. databases are down, etc). You can
also keep track of "rebalanced" inode numbers to avoid processing
hardlinked files more than once.

ZFS send/recv should also take care of these things, and with
sufficient space in the pool to ensure "even" writes (i.e. just
after expansion with new VDEVs) it can be done within the pool if
you don't have a spare one. Then you can ensure all needed "local"
dataset properties are transfered, remove the old dataset and
rename the new copy to its name (likewise for hierarchies of
datasets).



But I heard that a pool that is almost full have some performance
issues, especially when you try to delete files from that pool. But
maybe this becomes a non-issue once the pool is expanded by another vdev.


This issue may remain - basically, when a pool is nearly full (YMMV,
empirically over 80-90% for pools with many write-delete cycles,
but there were reports of even 60% full being a problem), its block
allocation may look like good cheese with many tiny holes. Walking
the free space to find a hole big enough to write a new block takes
time, hence the slowdown. When you expand the pool with a new vdev,
the old full cheesy one does not go away, and writes that ZFS pipe
line intended to put there would still lag (and may now time out and
may get to another vdev, as someone else mentioned in this thread).


To answer your other letters,

> But if I have two raidz3 vdevs, is there any way to create an
> isolation/separation between them so that if one of them fails, only the
> data that is stored within that vdev will be lost and all data that
> happen to be stored in the other can be recovered? And yet let them both
> be accessible from the same path?
>
> The only thing that needs to be sorted out is where the files should go
> when you write to that path and avoid splitting such that one half if
> the file goes to one vdev and another goes to the other vdev. Maybe
> there is some disk or i/o scheduler that can handle such operations?

You can't do that. A pool is one whole (you can't also remove vdevs
from it and you can't change or reduce raidzN groups' redundancy -
may be that will change after the long-awaited BPR = block-pointer
rewriter is implemented by some kind samaritan), and as soon as it
is set up or expanded all writes go striped to all components and
all top-level components are required not-failed to import the pool
and use it.

> I can't see how a dataset can span over several zpools as you usually
> create it with mypool/datasetname (in the case of a file system
> dataset). But I can see several datasets in one pool though (e.g.
> mypool/dataset1, mypool/dataset2 ...). So the relationship I see is pool
> *onto* dataset.

It can't. A dataset is contained in one pool. Many datasets can
be contained in one pool and share the free space, dedup table and
maybe some other resources. Datasets contained in different pools
are unrelated.

> But if I have two separate pools with separate names, say mypool1 and
> mypool2 I could create a zfs file system dataset with the same name in
> each of these pools and then give these two datasets the same
> "mountpoint" property couldn't I? Then they would be forced to be
> mounted to the same path.

One at a time - yes. Both at once (in a useful manner) - no.
If the mountpoint is not empty, zfs refuses to mount the dataset.
Even if you force it to (using overlay mount -o), the last mounted
dataset's filesystem will be all you'd see.

You can however mount other datasets into logical "subdirectories"
of the dataset you need to "expand", but those subs must be empty
or nonexistant in your currently existing "parent" dataset. Also
the new "children" are separate filesystems, so it is your quest
to move data into them if you need to free up the existing dataset,
and in particular remember that inodes of different filesystems
are unrelated, so hardlinks will break for those files that would
be forced to split from one inode in the source filesystem to
several inodes (i.e. some pathnames in the source FS and some in
the child) - like for any other FS boundary crossings.


* Can several datasets be mounted to the same mount point, i.e. can multiple "file 
system"-datasets be mounted so that they (the root of them) are all accessed from 
exactly the same (POSIX) path and subdirectories with coinciding names will be merged? 
The purpose of this would be to seamlessly expand storage capacity this way just like 
when adding vdevs to a pool.


What you describe 

Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-23 Thread Roy Sigurd Karlsbakk
> And hardlinks ?

For hardlinks, this is bad, indeed, so depending on if you use them or not, 
this may or may not be a good idea

> This is a perfect way to completely trash your
> system. There's no need to 'balance' zfs, over time filesystem
> writes will balance roughly over the vdevs, only files never
> touched again will stay where they are. So don't risk your
> system just to get a few bytes/sec more out of it.

With current versions of ZFS, writes are balanced ok over free space, unlike 
earlier. Still, if you want to maximise performance, you want your data to be 
levelled out on available drives. If we ever get BPR, this will solve this and 
many other problems…

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
r...@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-23 Thread Robin Axelsson

On 2012-10-23 17:32, Udo Grabowski (IMK) wrote:

On 23/10/2012 17:18, Roy Sigurd Karlsbakk wrote:

Wouldn't walking the filesystem, making a copy, deleting the original
and renaming the copy balance things?

e.g.

#!/bin/sh

LIST=`find /foo -type d`

for I in ${LIST}
do

cp ${I} ${I}.tmp
rm ${I}
mv ${I}.tmp ${I}

done


or perhaps

> 
And hardlinks ? This is a perfect way to completely trash your
system. There's no need to 'balance' zfs, over time filesystem
writes will balance roughly over the vdevs, only files never
touched again will stay where they are. So don't risk your
system just to get a few bytes/sec more out of it.

That sounds like a good point, unless you first scan for hard links and 
avoid touching the files and their hard links in the shell script, I guess.


But I heard that a pool that is almost full have some performance 
issues, especially when you try to delete files from that pool. But 
maybe this becomes a non-issue once the pool is expanded by another vdev.




___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-23 Thread Robin Axelsson

On 2012-10-23 16:22, George Wilson wrote:

Comments inline...

On 10/23/12 8:29 AM, Robin Axelsson wrote:

Hi,
I've been using zfs for a while but still there are some questions 
that have remained unanswered even after reading the documentation so 
I thought I would ask them here.


I have learned that zfs datasets can be expanded by adding vdevs. Say 
that you have created say a raidz3 pool named "mypool" with the command

# zpool create mypool raidz3 disk1 disk2 disk3 ... disk8

you can expand the capacity by adding vdevs to it through the command

# zpool add mypool raidz3 disk9 disk10 ... disk16

The vdev that is added doesn't need to have the same raid/mirror 
configuration or disk geometry, if I understand correctly. It will 
merely be dynamically concatenated with the old storage pool. The 
documentations says that it will be "striped" but it is not so clear 
what that means if data is already stored in the old vdevs of the pool.


Unanswered questions:

* What determines _where_ the data will be stored on a such a pool? 
Will it fill up the old vdev(s) before moving on to the new one or 
will the data be distributed evenly?


The data is written in a round-robin fashion across all the top-level 
vdevs (i.e. the raidz vdevs). So it will get distributed across them 
as you fill up the pool. It does not fill up one vdev before proceeding.


* If the old pool is almost full, an even distribution will be 
impossible, unless zpool rearranges/relocates data upon adding the 
vdev. Is that what will happen upon adding a vdev?


As you write new data it will try to even out the vdevs. In many cases 
this is not possible and you may end up with the majority of the 
writes going to the empty vdevs. There is logic in zfs to avoid 
certain vdevs if we're unable to allocate from them during a given 
transaction group commit. So when vdevs are very full you may find 
that very little data is being written to them.


* Can the individual vdevs be read independently/separately? If say 
the newly added vdev faults, will the entire pool be unreadable or 
will I still be able to access the old data? What if I took a 
snapshot before adding the new vdev?


If you lose a top-level vdev then you probably won't be able to access 
your old data. If you're lucky you might be able to retrieve some data 
that was not contained on that top-level vdev but given that ZFS 
stripes across all vdevs it means that most of your data could be 
lost. Losing a leaf vdev (i.e. a single disk) within a top-level vdev 
is a different story. If you lose a leaf vdev then raidz will allow 
you to continue to use the disk and pool in a degraded state. You can 
then spare out the failed leaf vdev or replace the disk.


* Can several datasets be mounted to the same mount point, i.e. can 
multiple "file system"-datasets be mounted so that they (the root of 
them) are all accessed from exactly the same (POSIX) path and 
subdirectories with coinciding names will be merged? The purpose of 
this would be to seamlessly expand storage capacity this way just 
like when adding vdevs to a pool.
I think you might be confused about datasets and how they are 
expanded. Datasets see all the space within a pool. There is not a 
one-to-one mapping of dataset to pool. So if you want to create 10 
datasets and you find that you're running out of space then you simply 
add another top-level vdev to your pool and all the dataset see the 
additional space. I pretty certain that doesn't answer your question 
but maybe it helps in other ways. Feel free to ask again.


But if I have two raidz3 vdevs, is there any way to create an 
isolation/separation between them so that if one of them fails, only the 
data that is stored within that vdev will be lost and all data that 
happen to be stored in the other can be recovered? And yet let them both 
be accessible from the same path?


The only thing that needs to be sorted out is where the files should go 
when you write to that path and avoid splitting such that one half if 
the file goes to one vdev and another goes to the other vdev. Maybe 
there is some disk or i/o scheduler that can handle such operations?



I can't see how a dataset can span over several zpools as you usually 
create it with mypool/datasetname (in the case of a file system 
dataset). But I can see several datasets in one pool though (e.g. 
mypool/dataset1, mypool/dataset2 ...). So the relationship I see is pool 
*onto* dataset.


But if I have two separate pools with separate names, say mypool1 and 
mypool2 I could create a zfs file system dataset with the same name in 
each of these pools and then give these two datasets the same 
"mountpoint" property couldn't I? Then they would be forced to be 
mounted to the same path.


I feel now that the other questions are straightened out.
* If that's the case how will the data be distributed/allocated over 
the datasets if I copy a data file to that path?


Data from all datasets are striped across the top-l

Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-23 Thread Udo Grabowski (IMK)

On 23/10/2012 17:18, Roy Sigurd Karlsbakk wrote:

Wouldn't walking the filesystem, making a copy, deleting the original
and renaming the copy balance things?

e.g.

#!/bin/sh

LIST=`find /foo -type d`

for I in ${LIST}
do

cp ${I} ${I}.tmp
rm ${I}
mv ${I}.tmp ${I}

done


or perhaps

> 
And hardlinks ? This is a perfect way to completely trash your
system. There's no need to 'balance' zfs, over time filesystem
writes will balance roughly over the vdevs, only files never
touched again will stay where they are. So don't risk your
system just to get a few bytes/sec more out of it.
--
Dr.Udo GrabowskiInst.f.Meteorology a.Climate Research IMK-ASF-SAT
www-imk.fzk.de/asf/sat/grabowski/ www.imk-asf.kit.edu/english/sat.php
KIT - Karlsruhe Institute of Technologyhttp://www.kit.edu
Postfach 3640,76021 Karlsruhe,Germany  T:(+49)721 608-26026 F:-926026

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-23 Thread Gregory S. Youngblood
Probably should use find -type f to limit to files and also cp -a to maintain 
permissions and ownership. Not sure if the will maintain ACLs.

For the truly paranoid, dont delete the original file so early, rename it, move 
the temp file back as the original filename, then compare md5 or sha checksums 
to make sure they are the same, only deleting the original file if the two sums 
match.

--
Sent from my Jelly Bean Galaxy Nexus

Roy Sigurd Karlsbakk  wrote:

>> Wouldn't walking the filesystem, making a copy, deleting the original
>> and renaming the copy balance things?
>> 
>> e.g.
>> 
>> #!/bin/sh
>> 
>> LIST=`find /foo -type d`
>> 
>> for I in ${LIST}
>> do
>> 
>> cp ${I} ${I}.tmp
>> rm ${I}
>> mv ${I}.tmp ${I}
>> 
>> done
>
>or perhaps
>
># === rewrite.sh ===
>#!/bin/bash
>
>$fn=$1
>$newfn=$fn.tmp
>
>cp $fn $newfn
>rm -f $fn
>mv $newfn $fn
># === rewrite.sh ===
>
>find /foo -type f -exec /path/to/rewrite.h {} \;
>
>Vennlige hilsener / Best regards
>
>roy
>--
>Roy Sigurd Karlsbakk
>(+47) 98013356
>r...@karlsbakk.net
>http://blogg.karlsbakk.net/
>GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
>--
>I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
>et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
>idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og 
>relevante synonymer på norsk.
>
>___
>OpenIndiana-discuss mailing list
>OpenIndiana-discuss@openindiana.org
>http://openindiana.org/mailman/listinfo/openindiana-discuss
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-23 Thread Doug Hughes

On 10/23/2012 11:08 AM, Robin Axelsson wrote:

On 2012-10-23 15:41, Doug Hughes wrote:

On 10/23/2012 8:29 AM, Robin Axelsson wrote:

Hi,
I've been using zfs for a while but still there are some questions 
that have remained unanswered even after reading the documentation 
so I thought I would ask them here.


I have learned that zfs datasets can be expanded by adding vdevs. 
Say that you have created say a raidz3 pool named "mypool" with the 
command

# zpool create mypool raidz3 disk1 disk2 disk3 ... disk8

you can expand the capacity by adding vdevs to it through the command

# zpool add mypool raidz3 disk9 disk10 ... disk16

The vdev that is added doesn't need to have the same raid/mirror 
configuration or disk geometry, if I understand correctly. It will 
merely be dynamically concatenated with the old storage pool. The 
documentations says that it will be "striped" but it is not so clear 
what that means if data is already stored in the old vdevs of the pool.


Unanswered questions:

* What determines _where_ the data will be stored on a such a pool? 
Will it fill up the old vdev(s) before moving on to the new one or 
will the data be distributed evenly?
* If the old pool is almost full, an even distribution will be 
impossible, unless zpool rearranges/relocates data upon adding the 
vdev. Is that what will happen upon adding a vdev?
* Can the individual vdevs be read independently/separately? If say 
the newly added vdev faults, will the entire pool be unreadable or 
will I still be able to access the old data? What if I took a 
snapshot before adding the new vdev?


* Can several datasets be mounted to the same mount point, i.e. can 
multiple "file system"-datasets be mounted so that they (the root of 
them) are all accessed from exactly the same (POSIX) path and 
subdirectories with coinciding names will be merged? The purpose of 
this would be to seamlessly expand storage capacity this way just 
like when adding vdevs to a pool.
* If that's the case how will the data be distributed/allocated over 
the datasets if I copy a data file to that path?


Kind regards
Robin.


*) yes, you can dynamically add more disks and zfs will just start 
using them.

*) zfs stripes across all vdevs evenly, as it can.
*) as your old vdev gets full, zfs will only allocate blocks to the 
newer, less full vdev
*) since it's a stripe across vdevs (and they should all be raidz2 or 
better!) if one vdev fails, your filesystem will be unavailable. They 
are not independent unless you put them in a separate pool.
*) you cannot have overlapping /mixed filesystems at exactly the same 
place, however it is perfectly possible to have e.g. /export be on 
rootpool, /export/mystuff on zpool1 and /export/mystuff/morestuff be 
on zpool2.


The unasked question is "If I wanted the vdevs to be equally 
balanced, could I?". The answers is a qualified yes. What you would 
need to do is reopen every single file, buffer it to memory, then 
write every block out again. We did this operation once. It means 
that all vdevs will roughly have the same block allocation when you 
are done.



Do you happen to know how that's done in OI? Otherwise I would have to 
move each file one by one to a disk location outside the dataset and 
then move it back or zfs send the dataset to another pool of at least 
equal size and then zfs receive it back to the expanded pool.


you don't have to move it, you just have to open, read it into memory, 
seek back to the beginning, and write it out again. Rewriting those 
blocks will take care of it since ZFS is copy-on-write. You will need to 
be wary of your snapshots during this process since all files will be 
rewritten and you'll double your space consumption.


(basically a perl, python, or other similar script could do this)


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-23 Thread Roy Sigurd Karlsbakk
> Wouldn't walking the filesystem, making a copy, deleting the original
> and renaming the copy balance things?
> 
> e.g.
> 
> #!/bin/sh
> 
> LIST=`find /foo -type d`
> 
> for I in ${LIST}
> do
> 
> cp ${I} ${I}.tmp
> rm ${I}
> mv ${I}.tmp ${I}
> 
> done

or perhaps

# === rewrite.sh ===
#!/bin/bash

$fn=$1
$newfn=$fn.tmp

cp $fn $newfn
rm -f $fn
mv $newfn $fn
# === rewrite.sh ===

find /foo -type f -exec /path/to/rewrite.h {} \;

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
r...@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-23 Thread Roy Sigurd Karlsbakk
> Do you happen to know how that's done in OI? Otherwise I would have to
> move each file one by one to a disk location outside the dataset and
> then move it back or zfs send the dataset to another pool of at least
> equal size and then zfs receive it back to the expanded pool.

Unless something was added recently, this isn't something ZFS can do itself. 
You can, however, do a "find /dataset -type f -exec rewrite.sh {} \;" or 
something similar

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
r...@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-23 Thread Reginald Beardsley


--- On Tue, 10/23/12, Doug Hughes  wrote:

[snip]

> The unasked question is "If I wanted the vdevs to be equally
> balanced, could I?". The answers is a qualified yes. What
> you would need to do is reopen every single file, buffer it
> to memory, then write every block out again. We did this
> operation once. It means that all vdevs will roughly have
> the same block allocation when you are done.

Wouldn't walking the filesystem, making a copy, deleting the original and 
renaming the copy balance things?

e.g.

#!/bin/sh

LIST=`find /foo -type d`

for I in ${LIST}
   do

   cp ${I} ${I}.tmp
   rm ${I}
   mv ${I}.tmp ${I}

done


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-23 Thread Robin Axelsson

On 2012-10-23 15:41, Doug Hughes wrote:

On 10/23/2012 8:29 AM, Robin Axelsson wrote:

Hi,
I've been using zfs for a while but still there are some questions 
that have remained unanswered even after reading the documentation so 
I thought I would ask them here.


I have learned that zfs datasets can be expanded by adding vdevs. Say 
that you have created say a raidz3 pool named "mypool" with the command

# zpool create mypool raidz3 disk1 disk2 disk3 ... disk8

you can expand the capacity by adding vdevs to it through the command

# zpool add mypool raidz3 disk9 disk10 ... disk16

The vdev that is added doesn't need to have the same raid/mirror 
configuration or disk geometry, if I understand correctly. It will 
merely be dynamically concatenated with the old storage pool. The 
documentations says that it will be "striped" but it is not so clear 
what that means if data is already stored in the old vdevs of the pool.


Unanswered questions:

* What determines _where_ the data will be stored on a such a pool? 
Will it fill up the old vdev(s) before moving on to the new one or 
will the data be distributed evenly?
* If the old pool is almost full, an even distribution will be 
impossible, unless zpool rearranges/relocates data upon adding the 
vdev. Is that what will happen upon adding a vdev?
* Can the individual vdevs be read independently/separately? If say 
the newly added vdev faults, will the entire pool be unreadable or 
will I still be able to access the old data? What if I took a 
snapshot before adding the new vdev?


* Can several datasets be mounted to the same mount point, i.e. can 
multiple "file system"-datasets be mounted so that they (the root of 
them) are all accessed from exactly the same (POSIX) path and 
subdirectories with coinciding names will be merged? The purpose of 
this would be to seamlessly expand storage capacity this way just 
like when adding vdevs to a pool.
* If that's the case how will the data be distributed/allocated over 
the datasets if I copy a data file to that path?


Kind regards
Robin.


*) yes, you can dynamically add more disks and zfs will just start 
using them.

*) zfs stripes across all vdevs evenly, as it can.
*) as your old vdev gets full, zfs will only allocate blocks to the 
newer, less full vdev
*) since it's a stripe across vdevs (and they should all be raidz2 or 
better!) if one vdev fails, your filesystem will be unavailable. They 
are not independent unless you put them in a separate pool.
*) you cannot have overlapping /mixed filesystems at exactly the same 
place, however it is perfectly possible to have e.g. /export be on 
rootpool, /export/mystuff on zpool1 and /export/mystuff/morestuff be 
on zpool2.


The unasked question is "If I wanted the vdevs to be equally balanced, 
could I?". The answers is a qualified yes. What you would need to do 
is reopen every single file, buffer it to memory, then write every 
block out again. We did this operation once. It means that all vdevs 
will roughly have the same block allocation when you are done.



Do you happen to know how that's done in OI? Otherwise I would have to 
move each file one by one to a disk location outside the dataset and 
then move it back or zfs send the dataset to another pool of at least 
equal size and then zfs receive it back to the expanded pool.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss






___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-23 Thread George Wilson

Comments inline...

On 10/23/12 8:29 AM, Robin Axelsson wrote:

Hi,
I've been using zfs for a while but still there are some questions 
that have remained unanswered even after reading the documentation so 
I thought I would ask them here.


I have learned that zfs datasets can be expanded by adding vdevs. Say 
that you have created say a raidz3 pool named "mypool" with the command

# zpool create mypool raidz3 disk1 disk2 disk3 ... disk8

you can expand the capacity by adding vdevs to it through the command

# zpool add mypool raidz3 disk9 disk10 ... disk16

The vdev that is added doesn't need to have the same raid/mirror 
configuration or disk geometry, if I understand correctly. It will 
merely be dynamically concatenated with the old storage pool. The 
documentations says that it will be "striped" but it is not so clear 
what that means if data is already stored in the old vdevs of the pool.


Unanswered questions:

* What determines _where_ the data will be stored on a such a pool? 
Will it fill up the old vdev(s) before moving on to the new one or 
will the data be distributed evenly?


The data is written in a round-robin fashion across all the top-level 
vdevs (i.e. the raidz vdevs). So it will get distributed across them as 
you fill up the pool. It does not fill up one vdev before proceeding.


* If the old pool is almost full, an even distribution will be 
impossible, unless zpool rearranges/relocates data upon adding the 
vdev. Is that what will happen upon adding a vdev?


As you write new data it will try to even out the vdevs. In many cases 
this is not possible and you may end up with the majority of the writes 
going to the empty vdevs. There is logic in zfs to avoid certain vdevs 
if we're unable to allocate from them during a given transaction group 
commit. So when vdevs are very full you may find that very little data 
is being written to them.


* Can the individual vdevs be read independently/separately? If say 
the newly added vdev faults, will the entire pool be unreadable or 
will I still be able to access the old data? What if I took a snapshot 
before adding the new vdev?


If you lose a top-level vdev then you probably won't be able to access 
your old data. If you're lucky you might be able to retrieve some data 
that was not contained on that top-level vdev but given that ZFS stripes 
across all vdevs it means that most of your data could be lost. Losing a 
leaf vdev (i.e. a single disk) within a top-level vdev is a different 
story. If you lose a leaf vdev then raidz will allow you to continue to 
use the disk and pool in a degraded state. You can then spare out the 
failed leaf vdev or replace the disk.


* Can several datasets be mounted to the same mount point, i.e. can 
multiple "file system"-datasets be mounted so that they (the root of 
them) are all accessed from exactly the same (POSIX) path and 
subdirectories with coinciding names will be merged? The purpose of 
this would be to seamlessly expand storage capacity this way just like 
when adding vdevs to a pool.
I think you might be confused about datasets and how they are expanded. 
Datasets see all the space within a pool. There is not a one-to-one 
mapping of dataset to pool. So if you want to create 10 datasets and you 
find that you're running out of space then you simply add another 
top-level vdev to your pool and all the dataset see the additional 
space. I pretty certain that doesn't answer your question but maybe it 
helps in other ways. Feel free to ask again.


* If that's the case how will the data be distributed/allocated over 
the datasets if I copy a data file to that path?


Data from all datasets are striped across the top-level vdevs. The 
notion of a given dataset only writing to a single raidz device in the 
pool does not exist.


Thanks,
George



Kind regards
Robin.



___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss



___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

2012-10-23 Thread Doug Hughes

On 10/23/2012 8:29 AM, Robin Axelsson wrote:

Hi,
I've been using zfs for a while but still there are some questions 
that have remained unanswered even after reading the documentation so 
I thought I would ask them here.


I have learned that zfs datasets can be expanded by adding vdevs. Say 
that you have created say a raidz3 pool named "mypool" with the command

# zpool create mypool raidz3 disk1 disk2 disk3 ... disk8

you can expand the capacity by adding vdevs to it through the command

# zpool add mypool raidz3 disk9 disk10 ... disk16

The vdev that is added doesn't need to have the same raid/mirror 
configuration or disk geometry, if I understand correctly. It will 
merely be dynamically concatenated with the old storage pool. The 
documentations says that it will be "striped" but it is not so clear 
what that means if data is already stored in the old vdevs of the pool.


Unanswered questions:

* What determines _where_ the data will be stored on a such a pool? 
Will it fill up the old vdev(s) before moving on to the new one or 
will the data be distributed evenly?
* If the old pool is almost full, an even distribution will be 
impossible, unless zpool rearranges/relocates data upon adding the 
vdev. Is that what will happen upon adding a vdev?
* Can the individual vdevs be read independently/separately? If say 
the newly added vdev faults, will the entire pool be unreadable or 
will I still be able to access the old data? What if I took a snapshot 
before adding the new vdev?


* Can several datasets be mounted to the same mount point, i.e. can 
multiple "file system"-datasets be mounted so that they (the root of 
them) are all accessed from exactly the same (POSIX) path and 
subdirectories with coinciding names will be merged? The purpose of 
this would be to seamlessly expand storage capacity this way just like 
when adding vdevs to a pool.
* If that's the case how will the data be distributed/allocated over 
the datasets if I copy a data file to that path?


Kind regards
Robin.


*) yes, you can dynamically add more disks and zfs will just start using 
them.

*) zfs stripes across all vdevs evenly, as it can.
*) as your old vdev gets full, zfs will only allocate blocks to the 
newer, less full vdev
*) since it's a stripe across vdevs (and they should all be raidz2 or 
better!) if one vdev fails, your filesystem will be unavailable. They 
are not independent unless you put them in a separate pool.
*) you cannot have overlapping /mixed filesystems at exactly the same 
place, however it is perfectly possible to have e.g. /export be on 
rootpool, /export/mystuff on zpool1 and /export/mystuff/morestuff be on 
zpool2.


The unasked question is "If I wanted the vdevs to be equally balanced, 
could I?". The answers is a qualified yes. What you would need to do is 
reopen every single file, buffer it to memory, then write every block 
out again. We did this operation once. It means that all vdevs will 
roughly have the same block allocation when you are done.




___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss