Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
On Oct 28, 2012, at 5:10 AM, Robin Axelsson wrote: > On 2012-10-24 21:58, Timothy Coalson wrote: >> On Wed, Oct 24, 2012 at 6:17 AM, Robin Axelsson< >> gu99r...@student.chalmers.se> wrote: >>> It would be interesting to know how you convert a raidz2 stripe to say a >>> raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra >>> parity drive by converting it to a raidz3 pool. I'm imagining that would >>> be like creating a raidz1 pool on top of the leaf vdevs that constitutes >>> the raidz2 pool and the new leaf vdev which results in an additional parity >>> drive. It doesn't sound too difficult to do that. Actually, this way you >>> could even get raidz4 or raidz5 pools. Question is though, how things would >>> pan out performance wise, I would imagine that a 55 drive raidz25 pool is >>> really taxing on the CPU. >>> >> Multiple parity is more complicated than that, an additional xor device (a >> la traditional raid4) would end up with zeros everywhere, and couldn't >> reconstruct your data from an additional failure. Look at "computing >> parity" in http://en.wikipedia.org/wiki/Raid_6#RAID_6 . While in theory it >> can extend to more than 3 parity blocks, it is unclear whether more than 3 >> will offer any serious additional benefits (using multiple raidz2 vdevs can >> give you better IOPS than larger raidz3 vdevs, with little change in raw >> space efficiency). There are also combinatorial implications to multiple >> bit errors in a single data chunk with high parity levels, but that is >> somewhat unlikely. > > XOR you say? I didn't know that raidz used xor for parity. I thought they > used some kind of a Reed-Solomon implementation à la PAR2 on the block level > to achieve "RAID like" functionality. It never was stated from what I could > read in the documentation that the raid functionality was implemented like > traditional hardware RAID. If xor is the case then I'm curious as to how they > managed to pull off a raidz3 implementation with three disk redundancy. The first parity is XOR (also a Reed-Solomon syndrome). The 2nd and 3rd parity are other syndromes. Also, minor nit: there is no such thing as hardware RAID, there is only software RAID. -- richard > > Maybe a good read into the zpool source code would help clarifying things... > >> >> Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a >>> no-brainer; you just remove one drive from the pool and force zpool to >>> accept the new state as "normal". >>> >> A degraded raidz2 vdev has to compute the missing block from parity on >> nearly every read, this is not the normal state of raidz1. Changing the >> parity level, either up or down, has similar complications in the on-disk >> structure. >> >> But expanding a raidz pool with additional storage while preserving the >>> parity structure sounds a little bit trickier. I don't think I have that >>> knowledge to write a bpr rewriter although I'm reading Solaris Internals >>> right now ;) >> >> Unless raidz* did something radically different than raid5/6 (as in, not >> having the parity blocks necessarily next to each other in the data chunk, >> and having their positions recorded in the data chunk itself), the position >> of the parity and data blocks would change. The "always consistent on >> disk" approach of ZFS adds additional problems to this, which probably make >> it impossible to rewrite the re-parity'ed chunk over the old chunk, meaning >> it has to find some free space every time it wants to update a chunk to the >> new parity level. >> >> What you describe here is known as unionfs in Linux, among others. I think there were RFEs or otherwise expressed desires to make that in Solaris and later illumos (I did campaign for that sometime ago), but AFAIK this was not yet done by anyone. YES, UnionFS-like functionality is what I was talking about. It seems >>> like it has been abandoned in favor of AuFS in the Linux and the BSD world. >>> It seems to have functions that are a little overkill to use with zfs, such >>> as copy-on-write. Perhaps a more simplistic implementation of it would be >>> more suitable for zfs. >>> >> You could create zfs filesystems for subfolders in your "dataset" from the >> separate pools, and give them mountpoints that put them into the same >> directory. You would have to balance the data allocation between the pools >> manually, though. > > I know that works but I was talking about having files stored at different > (hardware) locations and yet being in the same ... folder, I guess you are > using MacOS :) > >> >> Perhaps a similar functionality can be established through an abstraction >>> layer behind network shares. >>> >>> In Windows this functionality is called 'disk pooling', btw. >> >> In ZFS, disk pooling is done by "creating a zpool", emphasis on singular. >> Do you actually expect a large portion of your disks to go offline >> suddenly? I don't see a g
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
On Sun, Oct 28, 2012 at 6:42 AM, Robin Axelsson wrote: > As I understand, fragmentation occurs when you remove files and rewrite > files to a storage pool a large enough number of times. So if I have a > fragmented storage pool and copy all files to a new empty storage pool, the > new storage pool would not be fragmented. So I take it what you are saying > is that the way the files are organized in the new storage pool is sometimes > *worse* than the way they are organized in the old fragmented storage pool? Robin, I think you missed this part by Jim: On Thu, Oct 25, 2012 at 4:21 AM, Jim Klimov wrote: > This was discussed several times, with the outcome being that with > ZFS's data allocation policies, there is no one good defrag policy. > The two most popular options are about storing the current "live" > copy of a file contiguously (as opposed to its history of released > blocks only referenced in snapshots) vs. storing pool blocks in > ascending creation-TXG order (to arguably speed up scrubs and > resilvers, which can consume a noticeable portion of performance > doing random IO). > > Usual users mostly think that the first goal is good - however, > if you add clones and dedup into the equation, it might never be > possible to retain their benefits AND store all files contiguously. There are two important points: 1) if 'dedup=on' or you have clones, there is no way to defragment the files unless you turn off dedup, and make two entirely separate filesystems for your clones... 2) as soon as your newly-created filesystem has a snapshot and you change some data, there are at least two 'right' answers for what should be 'contiguous': the original snapshot, or the current 'live' filesystem, with good arguments on both sides. You seem to feel the latter, while for my usage I feel the former (see... even we can't agree :-)). Jan ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
On 2012-10-25 12:21, Jim Klimov wrote: 2012-10-24 15:17, Robin Axelsson wrote: On 2012-10-23 20:06, Jim Klimov wrote: 2012-10-23 19:53, Robin Axelsson wrote: ... But if I do send/receive to the same pool I will need to have enough free space in it to fit at least two copies of the dataset I want to reallocate. Likewise with "reallocation" of files - though the unit of required space would be smaller. It seems like what zfs is missing here is a good defrag tool. This was discussed several times, with the outcome being that with ZFS's data allocation policies, there is no one good defrag policy. The two most popular options are about storing the current "live" copy of a file contiguously (as opposed to its history of released blocks only referenced in snapshots) vs. storing pool blocks in ascending creation-TXG order (to arguably speed up scrubs and resilvers, which can consume a noticeable portion of performance doing random IO). Usual users mostly think that the first goal is good - however, if you add clones and dedup into the equation, it might never be possible to retain their benefits AND store all files contiguously. Also, as with other matters of moving blocks around in the allocation areas and transparently to other layers of the system (that is, on a live system that actively does I/O while you defrag data), there are some other problems that I'm not very qualified to speculate about, that are deemed to be solvable by the generic BPR. Still, I do think that many of the problems postponed until the time that BPR arrives, can be solved with different methods and limitations (such as off-line mangling of data on the pool) which might still be acceptable to some use-cases. All-in-all, the main intended usage of ZFS is on relatively powerful enterprise-class machines, where much of the needed data is cached on SSD or in huge RAM, so random HDD IO lags become less relevant. This situation is most noticeable with deduplication, which in ZFS implementation requires vast resources to basically function. With market prices going down over time, it is more likely to see home-NAS boxes tomorrow similarly spec'ed to enterprise servers of today, than to see the core software fundamentally revised and rearchitected for boxes of yesterday. After all, even in open-source world, developers need to eat and feed their families, so commercial applicability does matter and does influence the engineering designs and trade-offs. Actually I have refrained from using dedup and compression as I somehow feel that the risks and penalties that come with them outweigh the benefits, at least for my purposes. It should also be noted that in a lot of defrag software out there you have the option to choose different policies according to which you can reorganize the data on a disk. A tool that continuously monitors disk activity could give information as to what policy would be optimal for a particular disk configuration. As I understand, fragmentation occurs when you remove files and rewrite files to a storage pool a large enough number of times. So if I have a fragmented storage pool and copy all files to a new empty storage pool, the new storage pool would not be fragmented. So I take it what you are saying is that the way the files are organized in the new storage pool is sometimes *worse* than the way they are organized in the old fragmented storage pool? It would be interesting to know how you convert a raidz2 stripe to say a raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra parity drive by converting it to a raidz3 pool. I'm imagining that would be like creating a raidz1 pool on top of the leaf vdevs that constitutes the raidz2 pool and the new leaf vdev which results in an additional parity drive. It doesn't sound too difficult to do that. Actually, this way you could even get raidz4 or raidz5 pools. Question is though, how things would pan out performance wise, I would imagine that a 55 drive raidz25 pool is really taxing on the CPU. Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a no-brainer; you just remove one drive from the pool and force zpool to accept the new state as "normal". But expanding a raidz pool with additional storage while preserving the parity structure sounds a little bit trickier. I don't think I have that knowledge to write a bpr rewriter although I'm reading Solaris Internals right now ;) Read also the ZFS On-Disk Specification (the one I saw is somewhat outdated, being from 2006, but most concepts and data structures are the foundation - expected to remain in place and be expanded upon). In short, if I got that all right, the leaf components of a top-level VDEV are striped upon creation and declared as an allocation area with its ID and monotonous offsets of sectors (and further subdivided into a couple hundred SPAs to reduce seeking). For example, on a 5-disk array the offsets of the pooled sectors might look
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
Yes, this sounds like an interesting solution but it doesn't seem that SAMFS or SAM-QFS is implemented in OI, I could be wrong. The documentation for that functionality don't seem to be accessible anymore. Access to docs.sun.com is redirected to a standard page on the Oracle website. On 2012-10-25 10:51, Jim Klimov wrote: 2012-10-24 23:58, Timothy Coalson wrote: > I doubt I would like the outcome of having > some software make arbitrary decisions of what real filesystem each > to put file on, and then having one filesystem fail, so if you really > expect this, you may be happier keeping the two pools separate and > deciding where to put stuff yourself (since if you are expecting a > set of disks to fail, I expect you would have some idea as to which > ones it would be, for instance an external enclosure). This to an extent sounds similar (doable with) hierarchical storage management, such as Sun's SAMFS/QFS solution. Essentially, this is a (virtual) filesystem where you set up storage rules based on last access times and frequencies, data types, etc. and where you have many tiers of storage (ranging from fast, small, expensive to slow, bulky, cheap), such as SSD Arrays - 15K SAS arrays - 7.2 SATA - Tape. New incoming data ends up on the fast tier. Old stale data lives on tapes. Data used sometimes migrates between tiers. The rules you define for the HSM system regulate how many copies on which tier you'd store, so loss of some devices should not be fatal - as well as cleaning up space on the faster tier to receive new data or to cache the old data requested by users and fetched from slower tiers. I did propose to add some HSM-type capabilities to ZFS, mostly with the goals of power-saving on home-NAS machines, so that the box could live with a couple of active disks (i.e. rpool and the "active-data" part of the data pool) while most of the data pool's disks can remain spun-down. Whenever a user reads some data from the pool (watching a movie or listening to music or processing his photos) the system would prefetch the data (perhaps a folder with MP3's) onto the cache disks and let the big ones spin down - with a home NAS and few users it is likely that if you're watching a movie, you system is otherwise unused for a couple of hours. Likewise, and this happens to be the trickier part, new writes to the data pool should go to the active disks and occasionally sync to and spread over the main pool disk. I hoped this can be all done transparently to users within ZFS, but overall discussions led to conclusion that this can better be done not within ZFS, but with some daemons (perhaps a dtrace-abusing script) doing the data migration and abstraction (the transparency to users). Besides, with introduction and advances in generic L2ARC, and with the possibility of file-level prefetch, much of that discussion became moot ;) Hope this small historical insight helps you :) //Jim Klimov ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss . ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
On 2012-10-24 21:58, Timothy Coalson wrote: On Wed, Oct 24, 2012 at 6:17 AM, Robin Axelsson< gu99r...@student.chalmers.se> wrote: It would be interesting to know how you convert a raidz2 stripe to say a raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra parity drive by converting it to a raidz3 pool. I'm imagining that would be like creating a raidz1 pool on top of the leaf vdevs that constitutes the raidz2 pool and the new leaf vdev which results in an additional parity drive. It doesn't sound too difficult to do that. Actually, this way you could even get raidz4 or raidz5 pools. Question is though, how things would pan out performance wise, I would imagine that a 55 drive raidz25 pool is really taxing on the CPU. Multiple parity is more complicated than that, an additional xor device (a la traditional raid4) would end up with zeros everywhere, and couldn't reconstruct your data from an additional failure. Look at "computing parity" in http://en.wikipedia.org/wiki/Raid_6#RAID_6 . While in theory it can extend to more than 3 parity blocks, it is unclear whether more than 3 will offer any serious additional benefits (using multiple raidz2 vdevs can give you better IOPS than larger raidz3 vdevs, with little change in raw space efficiency). There are also combinatorial implications to multiple bit errors in a single data chunk with high parity levels, but that is somewhat unlikely. XOR you say? I didn't know that raidz used xor for parity. I thought they used some kind of a Reed-Solomon implementation à la PAR2 on the block level to achieve "RAID like" functionality. It never was stated from what I could read in the documentation that the raid functionality was implemented like traditional hardware RAID. If xor is the case then I'm curious as to how they managed to pull off a raidz3 implementation with three disk redundancy. Maybe a good read into the zpool source code would help clarifying things... Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a no-brainer; you just remove one drive from the pool and force zpool to accept the new state as "normal". A degraded raidz2 vdev has to compute the missing block from parity on nearly every read, this is not the normal state of raidz1. Changing the parity level, either up or down, has similar complications in the on-disk structure. But expanding a raidz pool with additional storage while preserving the parity structure sounds a little bit trickier. I don't think I have that knowledge to write a bpr rewriter although I'm reading Solaris Internals right now ;) Unless raidz* did something radically different than raid5/6 (as in, not having the parity blocks necessarily next to each other in the data chunk, and having their positions recorded in the data chunk itself), the position of the parity and data blocks would change. The "always consistent on disk" approach of ZFS adds additional problems to this, which probably make it impossible to rewrite the re-parity'ed chunk over the old chunk, meaning it has to find some free space every time it wants to update a chunk to the new parity level. What you describe here is known as unionfs in Linux, among others. I think there were RFEs or otherwise expressed desires to make that in Solaris and later illumos (I did campaign for that sometime ago), but AFAIK this was not yet done by anyone. YES, UnionFS-like functionality is what I was talking about. It seems like it has been abandoned in favor of AuFS in the Linux and the BSD world. It seems to have functions that are a little overkill to use with zfs, such as copy-on-write. Perhaps a more simplistic implementation of it would be more suitable for zfs. You could create zfs filesystems for subfolders in your "dataset" from the separate pools, and give them mountpoints that put them into the same directory. You would have to balance the data allocation between the pools manually, though. I know that works but I was talking about having files stored at different (hardware) locations and yet being in the same ... folder, I guess you are using MacOS :) Perhaps a similar functionality can be established through an abstraction layer behind network shares. In Windows this functionality is called 'disk pooling', btw. In ZFS, disk pooling is done by "creating a zpool", emphasis on singular. Do you actually expect a large portion of your disks to go offline suddenly? I don't see a good way to handle this (good meaning there are no missing files under the expected error conditions) that gets you more than 50% of your raw storage capacity (mirrors across the boundary of what you expect to go down together). I doubt I would like the outcome of having some software make arbitrary decisions of what real filesystem to put each file on, and then having one filesystem fail, so if you really expect this, you may be happier keeping the two pools separate and deciding where to put stuff yourself (s
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
2012-10-24 15:17, Robin Axelsson wrote: On 2012-10-23 20:06, Jim Klimov wrote: 2012-10-23 19:53, Robin Axelsson wrote: ... But if I do send/receive to the same pool I will need to have enough free space in it to fit at least two copies of the dataset I want to reallocate. Likewise with "reallocation" of files - though the unit of required space would be smaller. It seems like what zfs is missing here is a good defrag tool. This was discussed several times, with the outcome being that with ZFS's data allocation policies, there is no one good defrag policy. The two most popular options are about storing the current "live" copy of a file contiguously (as opposed to its history of released blocks only referenced in snapshots) vs. storing pool blocks in ascending creation-TXG order (to arguably speed up scrubs and resilvers, which can consume a noticeable portion of performance doing random IO). Usual users mostly think that the first goal is good - however, if you add clones and dedup into the equation, it might never be possible to retain their benefits AND store all files contiguously. Also, as with other matters of moving blocks around in the allocation areas and transparently to other layers of the system (that is, on a live system that actively does I/O while you defrag data), there are some other problems that I'm not very qualified to speculate about, that are deemed to be solvable by the generic BPR. Still, I do think that many of the problems postponed until the time that BPR arrives, can be solved with different methods and limitations (such as off-line mangling of data on the pool) which might still be acceptable to some use-cases. All-in-all, the main intended usage of ZFS is on relatively powerful enterprise-class machines, where much of the needed data is cached on SSD or in huge RAM, so random HDD IO lags become less relevant. This situation is most noticeable with deduplication, which in ZFS implementation requires vast resources to basically function. With market prices going down over time, it is more likely to see home-NAS boxes tomorrow similarly spec'ed to enterprise servers of today, than to see the core software fundamentally revised and rearchitected for boxes of yesterday. After all, even in open-source world, developers need to eat and feed their families, so commercial applicability does matter and does influence the engineering designs and trade-offs. It would be interesting to know how you convert a raidz2 stripe to say a raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra parity drive by converting it to a raidz3 pool. I'm imagining that would be like creating a raidz1 pool on top of the leaf vdevs that constitutes the raidz2 pool and the new leaf vdev which results in an additional parity drive. It doesn't sound too difficult to do that. Actually, this way you could even get raidz4 or raidz5 pools. Question is though, how things would pan out performance wise, I would imagine that a 55 drive raidz25 pool is really taxing on the CPU. Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a no-brainer; you just remove one drive from the pool and force zpool to accept the new state as "normal". But expanding a raidz pool with additional storage while preserving the parity structure sounds a little bit trickier. I don't think I have that knowledge to write a bpr rewriter although I'm reading Solaris Internals right now ;) Read also the ZFS On-Disk Specification (the one I saw is somewhat outdated, being from 2006, but most concepts and data structures are the foundation - expected to remain in place and be expanded upon). In short, if I got that all right, the leaf components of a top-level VDEV are striped upon creation and declared as an allocation area with its ID and monotonous offsets of sectors (and further subdivided into a couple hundred SPAs to reduce seeking). For example, on a 5-disk array the offsets of the pooled sectors might look like this: 0 1 2 3 4 5 6 7 8 9 ... (For the purposes of offset numbering, sector size is 512b - even on 4Kb-sectored disks; I am not sure how that's processed in the address math - likely the ashift value helps pick the specific disk's number). Then when a piece of data is saved by the OS (kernel metadata or userland userdata), this is logically combined into a block, processed for storage (compressed, etc.) and depending on redundancy level some sectors with parity data are prepended to each set of data sectors. For a raidz2 of 5 disks you'd have 2 parity (P, p, b) and up to 3 data sectors (D, d, k) as in the example below: P0 P1 D0 D1 D2 P2 P3 D3 D4 p0 p1 d0 b0 b1 k0 k1 k2 ... ZFS allocates only as many sectors as are needed to store the redundancy and data for the block, so the data (and holes after removal of data) are not very predictably intermixed - as would be the case with traditional full-stripe RAID5/6. Still, this does allow recovery from the loss
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
2012-10-24 23:58, Timothy Coalson wrote: > I doubt I would like the outcome of having > some software make arbitrary decisions of what real filesystem each > to put file on, and then having one filesystem fail, so if you really > expect this, you may be happier keeping the two pools separate and > deciding where to put stuff yourself (since if you are expecting a > set of disks to fail, I expect you would have some idea as to which > ones it would be, for instance an external enclosure). This to an extent sounds similar (doable with) hierarchical storage management, such as Sun's SAMFS/QFS solution. Essentially, this is a (virtual) filesystem where you set up storage rules based on last access times and frequencies, data types, etc. and where you have many tiers of storage (ranging from fast, small, expensive to slow, bulky, cheap), such as SSD Arrays - 15K SAS arrays - 7.2 SATA - Tape. New incoming data ends up on the fast tier. Old stale data lives on tapes. Data used sometimes migrates between tiers. The rules you define for the HSM system regulate how many copies on which tier you'd store, so loss of some devices should not be fatal - as well as cleaning up space on the faster tier to receive new data or to cache the old data requested by users and fetched from slower tiers. I did propose to add some HSM-type capabilities to ZFS, mostly with the goals of power-saving on home-NAS machines, so that the box could live with a couple of active disks (i.e. rpool and the "active-data" part of the data pool) while most of the data pool's disks can remain spun-down. Whenever a user reads some data from the pool (watching a movie or listening to music or processing his photos) the system would prefetch the data (perhaps a folder with MP3's) onto the cache disks and let the big ones spin down - with a home NAS and few users it is likely that if you're watching a movie, you system is otherwise unused for a couple of hours. Likewise, and this happens to be the trickier part, new writes to the data pool should go to the active disks and occasionally sync to and spread over the main pool disk. I hoped this can be all done transparently to users within ZFS, but overall discussions led to conclusion that this can better be done not within ZFS, but with some daemons (perhaps a dtrace-abusing script) doing the data migration and abstraction (the transparency to users). Besides, with introduction and advances in generic L2ARC, and with the possibility of file-level prefetch, much of that discussion became moot ;) Hope this small historical insight helps you :) //Jim Klimov ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
On Wed, Oct 24, 2012 at 6:17 AM, Robin Axelsson < gu99r...@student.chalmers.se> wrote: > > It would be interesting to know how you convert a raidz2 stripe to say a > raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra > parity drive by converting it to a raidz3 pool. I'm imagining that would > be like creating a raidz1 pool on top of the leaf vdevs that constitutes > the raidz2 pool and the new leaf vdev which results in an additional parity > drive. It doesn't sound too difficult to do that. Actually, this way you > could even get raidz4 or raidz5 pools. Question is though, how things would > pan out performance wise, I would imagine that a 55 drive raidz25 pool is > really taxing on the CPU. > Multiple parity is more complicated than that, an additional xor device (a la traditional raid4) would end up with zeros everywhere, and couldn't reconstruct your data from an additional failure. Look at "computing parity" in http://en.wikipedia.org/wiki/Raid_6#RAID_6 . While in theory it can extend to more than 3 parity blocks, it is unclear whether more than 3 will offer any serious additional benefits (using multiple raidz2 vdevs can give you better IOPS than larger raidz3 vdevs, with little change in raw space efficiency). There are also combinatorial implications to multiple bit errors in a single data chunk with high parity levels, but that is somewhat unlikely. Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a > no-brainer; you just remove one drive from the pool and force zpool to > accept the new state as "normal". > A degraded raidz2 vdev has to compute the missing block from parity on nearly every read, this is not the normal state of raidz1. Changing the parity level, either up or down, has similar complications in the on-disk structure. But expanding a raidz pool with additional storage while preserving the > parity structure sounds a little bit trickier. I don't think I have that > knowledge to write a bpr rewriter although I'm reading Solaris Internals > right now ;) Unless raidz* did something radically different than raid5/6 (as in, not having the parity blocks necessarily next to each other in the data chunk, and having their positions recorded in the data chunk itself), the position of the parity and data blocks would change. The "always consistent on disk" approach of ZFS adds additional problems to this, which probably make it impossible to rewrite the re-parity'ed chunk over the old chunk, meaning it has to find some free space every time it wants to update a chunk to the new parity level. >> What you describe here is known as unionfs in Linux, among others. >> I think there were RFEs or otherwise expressed desires to make that >> in Solaris and later illumos (I did campaign for that sometime ago), >> but AFAIK this was not yet done by anyone. >> >> YES, UnionFS-like functionality is what I was talking about. It seems > like it has been abandoned in favor of AuFS in the Linux and the BSD world. > It seems to have functions that are a little overkill to use with zfs, such > as copy-on-write. Perhaps a more simplistic implementation of it would be > more suitable for zfs. > You could create zfs filesystems for subfolders in your "dataset" from the separate pools, and give them mountpoints that put them into the same directory. You would have to balance the data allocation between the pools manually, though. Perhaps a similar functionality can be established through an abstraction > layer behind network shares. > > In Windows this functionality is called 'disk pooling', btw. In ZFS, disk pooling is done by "creating a zpool", emphasis on singular. Do you actually expect a large portion of your disks to go offline suddenly? I don't see a good way to handle this (good meaning there are no missing files under the expected error conditions) that gets you more than 50% of your raw storage capacity (mirrors across the boundary of what you expect to go down together). I doubt I would like the outcome of having some software make arbitrary decisions of what real filesystem to put each file on, and then having one filesystem fail, so if you really expect this, you may be happier keeping the two pools separate and deciding where to put stuff yourself (since if you are expecting a set of disks to fail, I expect you would have some idea as to which ones it would be, for instance an external enclosure). If, on the other hand, you don't expect your hardware to drop an entire set of disks for no good reason, making them into one large storage pool and putting your filesystem in it will share your data transparently across all disks without needing to set anything else up. Tim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
On 2012-10-23 20:06, Jim Klimov wrote: 2012-10-23 19:53, Robin Axelsson wrote: That sounds like a good point, unless you first scan for hard links and avoid touching the files and their hard links in the shell script, I guess. I guess the idea about reading into memory and writing back into the same file (or "cat $SRC > /var/tmp/$SRC && cat /var/tmp/$SRC > $SRC" to be on the safer side) should take care of hardlinks, since the inode would stay the same. You should of course ensure that nobody uses the file in question (i.e. databases are down, etc). You can also keep track of "rebalanced" inode numbers to avoid processing hardlinked files more than once. ZFS send/recv should also take care of these things, and with sufficient space in the pool to ensure "even" writes (i.e. just after expansion with new VDEVs) it can be done within the pool if you don't have a spare one. Then you can ensure all needed "local" dataset properties are transfered, remove the old dataset and rename the new copy to its name (likewise for hierarchies of datasets). But if I do send/receive to the same pool I will need to have enough free space in it to fit at least two copies of the dataset I want to reallocate. But I heard that a pool that is almost full have some performance issues, especially when you try to delete files from that pool. But maybe this becomes a non-issue once the pool is expanded by another vdev. This issue may remain - basically, when a pool is nearly full (YMMV, empirically over 80-90% for pools with many write-delete cycles, but there were reports of even 60% full being a problem), its block allocation may look like good cheese with many tiny holes. Walking the free space to find a hole big enough to write a new block takes time, hence the slowdown. When you expand the pool with a new vdev, the old full cheesy one does not go away, and writes that ZFS pipe line intended to put there would still lag (and may now time out and may get to another vdev, as someone else mentioned in this thread). It seems like what zfs is missing here is a good defrag tool. To answer your other letters, > But if I have two raidz3 vdevs, is there any way to create an > isolation/separation between them so that if one of them fails, only the > data that is stored within that vdev will be lost and all data that > happen to be stored in the other can be recovered? And yet let them both > be accessible from the same path? > > The only thing that needs to be sorted out is where the files should go > when you write to that path and avoid splitting such that one half if > the file goes to one vdev and another goes to the other vdev. Maybe > there is some disk or i/o scheduler that can handle such operations? You can't do that. A pool is one whole (you can't also remove vdevs from it and you can't change or reduce raidzN groups' redundancy - may be that will change after the long-awaited BPR = block-pointer rewriter is implemented by some kind samaritan), and as soon as it is set up or expanded all writes go striped to all components and all top-level components are required not-failed to import the pool and use it. It would be interesting to know how you convert a raidz2 stripe to say a raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra parity drive by converting it to a raidz3 pool. I'm imagining that would be like creating a raidz1 pool on top of the leaf vdevs that constitutes the raidz2 pool and the new leaf vdev which results in an additional parity drive. It doesn't sound too difficult to do that. Actually, this way you could even get raidz4 or raidz5 pools. Question is though, how things would pan out performance wise, I would imagine that a 55 drive raidz25 pool is really taxing on the CPU. Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a no-brainer; you just remove one drive from the pool and force zpool to accept the new state as "normal". But expanding a raidz pool with additional storage while preserving the parity structure sounds a little bit trickier. I don't think I have that knowledge to write a bpr rewriter although I'm reading Solaris Internals right now ;) > I can't see how a dataset can span over several zpools as you usually > create it with mypool/datasetname (in the case of a file system > dataset). But I can see several datasets in one pool though (e.g. > mypool/dataset1, mypool/dataset2 ...). So the relationship I see is pool > *onto* dataset. It can't. A dataset is contained in one pool. Many datasets can be contained in one pool and share the free space, dedup table and maybe some other resources. Datasets contained in different pools are unrelated. > But if I have two separate pools with separate names, say mypool1 and > mypool2 I could create a zfs file system dataset with the same name in > each of these pools and then give these two datasets the same > "mountpoint" property couldn't I? Then they would be forced
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
2012-10-23 19:53, Robin Axelsson wrote: That sounds like a good point, unless you first scan for hard links and avoid touching the files and their hard links in the shell script, I guess. I guess the idea about reading into memory and writing back into the same file (or "cat $SRC > /var/tmp/$SRC && cat /var/tmp/$SRC > $SRC" to be on the safer side) should take care of hardlinks, since the inode would stay the same. You should of course ensure that nobody uses the file in question (i.e. databases are down, etc). You can also keep track of "rebalanced" inode numbers to avoid processing hardlinked files more than once. ZFS send/recv should also take care of these things, and with sufficient space in the pool to ensure "even" writes (i.e. just after expansion with new VDEVs) it can be done within the pool if you don't have a spare one. Then you can ensure all needed "local" dataset properties are transfered, remove the old dataset and rename the new copy to its name (likewise for hierarchies of datasets). But I heard that a pool that is almost full have some performance issues, especially when you try to delete files from that pool. But maybe this becomes a non-issue once the pool is expanded by another vdev. This issue may remain - basically, when a pool is nearly full (YMMV, empirically over 80-90% for pools with many write-delete cycles, but there were reports of even 60% full being a problem), its block allocation may look like good cheese with many tiny holes. Walking the free space to find a hole big enough to write a new block takes time, hence the slowdown. When you expand the pool with a new vdev, the old full cheesy one does not go away, and writes that ZFS pipe line intended to put there would still lag (and may now time out and may get to another vdev, as someone else mentioned in this thread). To answer your other letters, > But if I have two raidz3 vdevs, is there any way to create an > isolation/separation between them so that if one of them fails, only the > data that is stored within that vdev will be lost and all data that > happen to be stored in the other can be recovered? And yet let them both > be accessible from the same path? > > The only thing that needs to be sorted out is where the files should go > when you write to that path and avoid splitting such that one half if > the file goes to one vdev and another goes to the other vdev. Maybe > there is some disk or i/o scheduler that can handle such operations? You can't do that. A pool is one whole (you can't also remove vdevs from it and you can't change or reduce raidzN groups' redundancy - may be that will change after the long-awaited BPR = block-pointer rewriter is implemented by some kind samaritan), and as soon as it is set up or expanded all writes go striped to all components and all top-level components are required not-failed to import the pool and use it. > I can't see how a dataset can span over several zpools as you usually > create it with mypool/datasetname (in the case of a file system > dataset). But I can see several datasets in one pool though (e.g. > mypool/dataset1, mypool/dataset2 ...). So the relationship I see is pool > *onto* dataset. It can't. A dataset is contained in one pool. Many datasets can be contained in one pool and share the free space, dedup table and maybe some other resources. Datasets contained in different pools are unrelated. > But if I have two separate pools with separate names, say mypool1 and > mypool2 I could create a zfs file system dataset with the same name in > each of these pools and then give these two datasets the same > "mountpoint" property couldn't I? Then they would be forced to be > mounted to the same path. One at a time - yes. Both at once (in a useful manner) - no. If the mountpoint is not empty, zfs refuses to mount the dataset. Even if you force it to (using overlay mount -o), the last mounted dataset's filesystem will be all you'd see. You can however mount other datasets into logical "subdirectories" of the dataset you need to "expand", but those subs must be empty or nonexistant in your currently existing "parent" dataset. Also the new "children" are separate filesystems, so it is your quest to move data into them if you need to free up the existing dataset, and in particular remember that inodes of different filesystems are unrelated, so hardlinks will break for those files that would be forced to split from one inode in the source filesystem to several inodes (i.e. some pathnames in the source FS and some in the child) - like for any other FS boundary crossings. * Can several datasets be mounted to the same mount point, i.e. can multiple "file system"-datasets be mounted so that they (the root of them) are all accessed from exactly the same (POSIX) path and subdirectories with coinciding names will be merged? The purpose of this would be to seamlessly expand storage capacity this way just like when adding vdevs to a pool. What you describe
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
> And hardlinks ? For hardlinks, this is bad, indeed, so depending on if you use them or not, this may or may not be a good idea > This is a perfect way to completely trash your > system. There's no need to 'balance' zfs, over time filesystem > writes will balance roughly over the vdevs, only files never > touched again will stay where they are. So don't risk your > system just to get a few bytes/sec more out of it. With current versions of ZFS, writes are balanced ok over free space, unlike earlier. Still, if you want to maximise performance, you want your data to be levelled out on available drives. If we ever get BPR, this will solve this and many other problems… Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 98013356 r...@karlsbakk.net http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
On 2012-10-23 17:32, Udo Grabowski (IMK) wrote: On 23/10/2012 17:18, Roy Sigurd Karlsbakk wrote: Wouldn't walking the filesystem, making a copy, deleting the original and renaming the copy balance things? e.g. #!/bin/sh LIST=`find /foo -type d` for I in ${LIST} do cp ${I} ${I}.tmp rm ${I} mv ${I}.tmp ${I} done or perhaps > And hardlinks ? This is a perfect way to completely trash your system. There's no need to 'balance' zfs, over time filesystem writes will balance roughly over the vdevs, only files never touched again will stay where they are. So don't risk your system just to get a few bytes/sec more out of it. That sounds like a good point, unless you first scan for hard links and avoid touching the files and their hard links in the shell script, I guess. But I heard that a pool that is almost full have some performance issues, especially when you try to delete files from that pool. But maybe this becomes a non-issue once the pool is expanded by another vdev. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
On 2012-10-23 16:22, George Wilson wrote: Comments inline... On 10/23/12 8:29 AM, Robin Axelsson wrote: Hi, I've been using zfs for a while but still there are some questions that have remained unanswered even after reading the documentation so I thought I would ask them here. I have learned that zfs datasets can be expanded by adding vdevs. Say that you have created say a raidz3 pool named "mypool" with the command # zpool create mypool raidz3 disk1 disk2 disk3 ... disk8 you can expand the capacity by adding vdevs to it through the command # zpool add mypool raidz3 disk9 disk10 ... disk16 The vdev that is added doesn't need to have the same raid/mirror configuration or disk geometry, if I understand correctly. It will merely be dynamically concatenated with the old storage pool. The documentations says that it will be "striped" but it is not so clear what that means if data is already stored in the old vdevs of the pool. Unanswered questions: * What determines _where_ the data will be stored on a such a pool? Will it fill up the old vdev(s) before moving on to the new one or will the data be distributed evenly? The data is written in a round-robin fashion across all the top-level vdevs (i.e. the raidz vdevs). So it will get distributed across them as you fill up the pool. It does not fill up one vdev before proceeding. * If the old pool is almost full, an even distribution will be impossible, unless zpool rearranges/relocates data upon adding the vdev. Is that what will happen upon adding a vdev? As you write new data it will try to even out the vdevs. In many cases this is not possible and you may end up with the majority of the writes going to the empty vdevs. There is logic in zfs to avoid certain vdevs if we're unable to allocate from them during a given transaction group commit. So when vdevs are very full you may find that very little data is being written to them. * Can the individual vdevs be read independently/separately? If say the newly added vdev faults, will the entire pool be unreadable or will I still be able to access the old data? What if I took a snapshot before adding the new vdev? If you lose a top-level vdev then you probably won't be able to access your old data. If you're lucky you might be able to retrieve some data that was not contained on that top-level vdev but given that ZFS stripes across all vdevs it means that most of your data could be lost. Losing a leaf vdev (i.e. a single disk) within a top-level vdev is a different story. If you lose a leaf vdev then raidz will allow you to continue to use the disk and pool in a degraded state. You can then spare out the failed leaf vdev or replace the disk. * Can several datasets be mounted to the same mount point, i.e. can multiple "file system"-datasets be mounted so that they (the root of them) are all accessed from exactly the same (POSIX) path and subdirectories with coinciding names will be merged? The purpose of this would be to seamlessly expand storage capacity this way just like when adding vdevs to a pool. I think you might be confused about datasets and how they are expanded. Datasets see all the space within a pool. There is not a one-to-one mapping of dataset to pool. So if you want to create 10 datasets and you find that you're running out of space then you simply add another top-level vdev to your pool and all the dataset see the additional space. I pretty certain that doesn't answer your question but maybe it helps in other ways. Feel free to ask again. But if I have two raidz3 vdevs, is there any way to create an isolation/separation between them so that if one of them fails, only the data that is stored within that vdev will be lost and all data that happen to be stored in the other can be recovered? And yet let them both be accessible from the same path? The only thing that needs to be sorted out is where the files should go when you write to that path and avoid splitting such that one half if the file goes to one vdev and another goes to the other vdev. Maybe there is some disk or i/o scheduler that can handle such operations? I can't see how a dataset can span over several zpools as you usually create it with mypool/datasetname (in the case of a file system dataset). But I can see several datasets in one pool though (e.g. mypool/dataset1, mypool/dataset2 ...). So the relationship I see is pool *onto* dataset. But if I have two separate pools with separate names, say mypool1 and mypool2 I could create a zfs file system dataset with the same name in each of these pools and then give these two datasets the same "mountpoint" property couldn't I? Then they would be forced to be mounted to the same path. I feel now that the other questions are straightened out. * If that's the case how will the data be distributed/allocated over the datasets if I copy a data file to that path? Data from all datasets are striped across the top-l
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
On 23/10/2012 17:18, Roy Sigurd Karlsbakk wrote: Wouldn't walking the filesystem, making a copy, deleting the original and renaming the copy balance things? e.g. #!/bin/sh LIST=`find /foo -type d` for I in ${LIST} do cp ${I} ${I}.tmp rm ${I} mv ${I}.tmp ${I} done or perhaps > And hardlinks ? This is a perfect way to completely trash your system. There's no need to 'balance' zfs, over time filesystem writes will balance roughly over the vdevs, only files never touched again will stay where they are. So don't risk your system just to get a few bytes/sec more out of it. -- Dr.Udo GrabowskiInst.f.Meteorology a.Climate Research IMK-ASF-SAT www-imk.fzk.de/asf/sat/grabowski/ www.imk-asf.kit.edu/english/sat.php KIT - Karlsruhe Institute of Technologyhttp://www.kit.edu Postfach 3640,76021 Karlsruhe,Germany T:(+49)721 608-26026 F:-926026 ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
Probably should use find -type f to limit to files and also cp -a to maintain permissions and ownership. Not sure if the will maintain ACLs. For the truly paranoid, dont delete the original file so early, rename it, move the temp file back as the original filename, then compare md5 or sha checksums to make sure they are the same, only deleting the original file if the two sums match. -- Sent from my Jelly Bean Galaxy Nexus Roy Sigurd Karlsbakk wrote: >> Wouldn't walking the filesystem, making a copy, deleting the original >> and renaming the copy balance things? >> >> e.g. >> >> #!/bin/sh >> >> LIST=`find /foo -type d` >> >> for I in ${LIST} >> do >> >> cp ${I} ${I}.tmp >> rm ${I} >> mv ${I}.tmp ${I} >> >> done > >or perhaps > ># === rewrite.sh === >#!/bin/bash > >$fn=$1 >$newfn=$fn.tmp > >cp $fn $newfn >rm -f $fn >mv $newfn $fn ># === rewrite.sh === > >find /foo -type f -exec /path/to/rewrite.h {} \; > >Vennlige hilsener / Best regards > >roy >-- >Roy Sigurd Karlsbakk >(+47) 98013356 >r...@karlsbakk.net >http://blogg.karlsbakk.net/ >GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt >-- >I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er >et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av >idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og >relevante synonymer på norsk. > >___ >OpenIndiana-discuss mailing list >OpenIndiana-discuss@openindiana.org >http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
On 10/23/2012 11:08 AM, Robin Axelsson wrote: On 2012-10-23 15:41, Doug Hughes wrote: On 10/23/2012 8:29 AM, Robin Axelsson wrote: Hi, I've been using zfs for a while but still there are some questions that have remained unanswered even after reading the documentation so I thought I would ask them here. I have learned that zfs datasets can be expanded by adding vdevs. Say that you have created say a raidz3 pool named "mypool" with the command # zpool create mypool raidz3 disk1 disk2 disk3 ... disk8 you can expand the capacity by adding vdevs to it through the command # zpool add mypool raidz3 disk9 disk10 ... disk16 The vdev that is added doesn't need to have the same raid/mirror configuration or disk geometry, if I understand correctly. It will merely be dynamically concatenated with the old storage pool. The documentations says that it will be "striped" but it is not so clear what that means if data is already stored in the old vdevs of the pool. Unanswered questions: * What determines _where_ the data will be stored on a such a pool? Will it fill up the old vdev(s) before moving on to the new one or will the data be distributed evenly? * If the old pool is almost full, an even distribution will be impossible, unless zpool rearranges/relocates data upon adding the vdev. Is that what will happen upon adding a vdev? * Can the individual vdevs be read independently/separately? If say the newly added vdev faults, will the entire pool be unreadable or will I still be able to access the old data? What if I took a snapshot before adding the new vdev? * Can several datasets be mounted to the same mount point, i.e. can multiple "file system"-datasets be mounted so that they (the root of them) are all accessed from exactly the same (POSIX) path and subdirectories with coinciding names will be merged? The purpose of this would be to seamlessly expand storage capacity this way just like when adding vdevs to a pool. * If that's the case how will the data be distributed/allocated over the datasets if I copy a data file to that path? Kind regards Robin. *) yes, you can dynamically add more disks and zfs will just start using them. *) zfs stripes across all vdevs evenly, as it can. *) as your old vdev gets full, zfs will only allocate blocks to the newer, less full vdev *) since it's a stripe across vdevs (and they should all be raidz2 or better!) if one vdev fails, your filesystem will be unavailable. They are not independent unless you put them in a separate pool. *) you cannot have overlapping /mixed filesystems at exactly the same place, however it is perfectly possible to have e.g. /export be on rootpool, /export/mystuff on zpool1 and /export/mystuff/morestuff be on zpool2. The unasked question is "If I wanted the vdevs to be equally balanced, could I?". The answers is a qualified yes. What you would need to do is reopen every single file, buffer it to memory, then write every block out again. We did this operation once. It means that all vdevs will roughly have the same block allocation when you are done. Do you happen to know how that's done in OI? Otherwise I would have to move each file one by one to a disk location outside the dataset and then move it back or zfs send the dataset to another pool of at least equal size and then zfs receive it back to the expanded pool. you don't have to move it, you just have to open, read it into memory, seek back to the beginning, and write it out again. Rewriting those blocks will take care of it since ZFS is copy-on-write. You will need to be wary of your snapshots during this process since all files will be rewritten and you'll double your space consumption. (basically a perl, python, or other similar script could do this) ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
> Wouldn't walking the filesystem, making a copy, deleting the original > and renaming the copy balance things? > > e.g. > > #!/bin/sh > > LIST=`find /foo -type d` > > for I in ${LIST} > do > > cp ${I} ${I}.tmp > rm ${I} > mv ${I}.tmp ${I} > > done or perhaps # === rewrite.sh === #!/bin/bash $fn=$1 $newfn=$fn.tmp cp $fn $newfn rm -f $fn mv $newfn $fn # === rewrite.sh === find /foo -type f -exec /path/to/rewrite.h {} \; Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 98013356 r...@karlsbakk.net http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
> Do you happen to know how that's done in OI? Otherwise I would have to > move each file one by one to a disk location outside the dataset and > then move it back or zfs send the dataset to another pool of at least > equal size and then zfs receive it back to the expanded pool. Unless something was added recently, this isn't something ZFS can do itself. You can, however, do a "find /dataset -type f -exec rewrite.sh {} \;" or something similar Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 98013356 r...@karlsbakk.net http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
--- On Tue, 10/23/12, Doug Hughes wrote: [snip] > The unasked question is "If I wanted the vdevs to be equally > balanced, could I?". The answers is a qualified yes. What > you would need to do is reopen every single file, buffer it > to memory, then write every block out again. We did this > operation once. It means that all vdevs will roughly have > the same block allocation when you are done. Wouldn't walking the filesystem, making a copy, deleting the original and renaming the copy balance things? e.g. #!/bin/sh LIST=`find /foo -type d` for I in ${LIST} do cp ${I} ${I}.tmp rm ${I} mv ${I}.tmp ${I} done ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
On 2012-10-23 15:41, Doug Hughes wrote: On 10/23/2012 8:29 AM, Robin Axelsson wrote: Hi, I've been using zfs for a while but still there are some questions that have remained unanswered even after reading the documentation so I thought I would ask them here. I have learned that zfs datasets can be expanded by adding vdevs. Say that you have created say a raidz3 pool named "mypool" with the command # zpool create mypool raidz3 disk1 disk2 disk3 ... disk8 you can expand the capacity by adding vdevs to it through the command # zpool add mypool raidz3 disk9 disk10 ... disk16 The vdev that is added doesn't need to have the same raid/mirror configuration or disk geometry, if I understand correctly. It will merely be dynamically concatenated with the old storage pool. The documentations says that it will be "striped" but it is not so clear what that means if data is already stored in the old vdevs of the pool. Unanswered questions: * What determines _where_ the data will be stored on a such a pool? Will it fill up the old vdev(s) before moving on to the new one or will the data be distributed evenly? * If the old pool is almost full, an even distribution will be impossible, unless zpool rearranges/relocates data upon adding the vdev. Is that what will happen upon adding a vdev? * Can the individual vdevs be read independently/separately? If say the newly added vdev faults, will the entire pool be unreadable or will I still be able to access the old data? What if I took a snapshot before adding the new vdev? * Can several datasets be mounted to the same mount point, i.e. can multiple "file system"-datasets be mounted so that they (the root of them) are all accessed from exactly the same (POSIX) path and subdirectories with coinciding names will be merged? The purpose of this would be to seamlessly expand storage capacity this way just like when adding vdevs to a pool. * If that's the case how will the data be distributed/allocated over the datasets if I copy a data file to that path? Kind regards Robin. *) yes, you can dynamically add more disks and zfs will just start using them. *) zfs stripes across all vdevs evenly, as it can. *) as your old vdev gets full, zfs will only allocate blocks to the newer, less full vdev *) since it's a stripe across vdevs (and they should all be raidz2 or better!) if one vdev fails, your filesystem will be unavailable. They are not independent unless you put them in a separate pool. *) you cannot have overlapping /mixed filesystems at exactly the same place, however it is perfectly possible to have e.g. /export be on rootpool, /export/mystuff on zpool1 and /export/mystuff/morestuff be on zpool2. The unasked question is "If I wanted the vdevs to be equally balanced, could I?". The answers is a qualified yes. What you would need to do is reopen every single file, buffer it to memory, then write every block out again. We did this operation once. It means that all vdevs will roughly have the same block allocation when you are done. Do you happen to know how that's done in OI? Otherwise I would have to move each file one by one to a disk location outside the dataset and then move it back or zfs send the dataset to another pool of at least equal size and then zfs receive it back to the expanded pool. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
Comments inline... On 10/23/12 8:29 AM, Robin Axelsson wrote: Hi, I've been using zfs for a while but still there are some questions that have remained unanswered even after reading the documentation so I thought I would ask them here. I have learned that zfs datasets can be expanded by adding vdevs. Say that you have created say a raidz3 pool named "mypool" with the command # zpool create mypool raidz3 disk1 disk2 disk3 ... disk8 you can expand the capacity by adding vdevs to it through the command # zpool add mypool raidz3 disk9 disk10 ... disk16 The vdev that is added doesn't need to have the same raid/mirror configuration or disk geometry, if I understand correctly. It will merely be dynamically concatenated with the old storage pool. The documentations says that it will be "striped" but it is not so clear what that means if data is already stored in the old vdevs of the pool. Unanswered questions: * What determines _where_ the data will be stored on a such a pool? Will it fill up the old vdev(s) before moving on to the new one or will the data be distributed evenly? The data is written in a round-robin fashion across all the top-level vdevs (i.e. the raidz vdevs). So it will get distributed across them as you fill up the pool. It does not fill up one vdev before proceeding. * If the old pool is almost full, an even distribution will be impossible, unless zpool rearranges/relocates data upon adding the vdev. Is that what will happen upon adding a vdev? As you write new data it will try to even out the vdevs. In many cases this is not possible and you may end up with the majority of the writes going to the empty vdevs. There is logic in zfs to avoid certain vdevs if we're unable to allocate from them during a given transaction group commit. So when vdevs are very full you may find that very little data is being written to them. * Can the individual vdevs be read independently/separately? If say the newly added vdev faults, will the entire pool be unreadable or will I still be able to access the old data? What if I took a snapshot before adding the new vdev? If you lose a top-level vdev then you probably won't be able to access your old data. If you're lucky you might be able to retrieve some data that was not contained on that top-level vdev but given that ZFS stripes across all vdevs it means that most of your data could be lost. Losing a leaf vdev (i.e. a single disk) within a top-level vdev is a different story. If you lose a leaf vdev then raidz will allow you to continue to use the disk and pool in a degraded state. You can then spare out the failed leaf vdev or replace the disk. * Can several datasets be mounted to the same mount point, i.e. can multiple "file system"-datasets be mounted so that they (the root of them) are all accessed from exactly the same (POSIX) path and subdirectories with coinciding names will be merged? The purpose of this would be to seamlessly expand storage capacity this way just like when adding vdevs to a pool. I think you might be confused about datasets and how they are expanded. Datasets see all the space within a pool. There is not a one-to-one mapping of dataset to pool. So if you want to create 10 datasets and you find that you're running out of space then you simply add another top-level vdev to your pool and all the dataset see the additional space. I pretty certain that doesn't answer your question but maybe it helps in other ways. Feel free to ask again. * If that's the case how will the data be distributed/allocated over the datasets if I copy a data file to that path? Data from all datasets are striped across the top-level vdevs. The notion of a given dataset only writing to a single raidz device in the pool does not exist. Thanks, George Kind regards Robin. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...
On 10/23/2012 8:29 AM, Robin Axelsson wrote: Hi, I've been using zfs for a while but still there are some questions that have remained unanswered even after reading the documentation so I thought I would ask them here. I have learned that zfs datasets can be expanded by adding vdevs. Say that you have created say a raidz3 pool named "mypool" with the command # zpool create mypool raidz3 disk1 disk2 disk3 ... disk8 you can expand the capacity by adding vdevs to it through the command # zpool add mypool raidz3 disk9 disk10 ... disk16 The vdev that is added doesn't need to have the same raid/mirror configuration or disk geometry, if I understand correctly. It will merely be dynamically concatenated with the old storage pool. The documentations says that it will be "striped" but it is not so clear what that means if data is already stored in the old vdevs of the pool. Unanswered questions: * What determines _where_ the data will be stored on a such a pool? Will it fill up the old vdev(s) before moving on to the new one or will the data be distributed evenly? * If the old pool is almost full, an even distribution will be impossible, unless zpool rearranges/relocates data upon adding the vdev. Is that what will happen upon adding a vdev? * Can the individual vdevs be read independently/separately? If say the newly added vdev faults, will the entire pool be unreadable or will I still be able to access the old data? What if I took a snapshot before adding the new vdev? * Can several datasets be mounted to the same mount point, i.e. can multiple "file system"-datasets be mounted so that they (the root of them) are all accessed from exactly the same (POSIX) path and subdirectories with coinciding names will be merged? The purpose of this would be to seamlessly expand storage capacity this way just like when adding vdevs to a pool. * If that's the case how will the data be distributed/allocated over the datasets if I copy a data file to that path? Kind regards Robin. *) yes, you can dynamically add more disks and zfs will just start using them. *) zfs stripes across all vdevs evenly, as it can. *) as your old vdev gets full, zfs will only allocate blocks to the newer, less full vdev *) since it's a stripe across vdevs (and they should all be raidz2 or better!) if one vdev fails, your filesystem will be unavailable. They are not independent unless you put them in a separate pool. *) you cannot have overlapping /mixed filesystems at exactly the same place, however it is perfectly possible to have e.g. /export be on rootpool, /export/mystuff on zpool1 and /export/mystuff/morestuff be on zpool2. The unasked question is "If I wanted the vdevs to be equally balanced, could I?". The answers is a qualified yes. What you would need to do is reopen every single file, buffer it to memory, then write every block out again. We did this operation once. It means that all vdevs will roughly have the same block allocation when you are done. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss