Re: ZFS regimen: scrub, scrub, scrub and scrub again.
here is my real world production example of users mail as well as documents. /dev/mirror/home1.eli 2788 1545 1243 55% 1941057 20981181 8% /home Not the same data, I imagine. A mix. 90% Mailboxes and user data (documents, pictures), rest are some .tar.gz backups. At other places i have similar situation. one or more gmirror sets, 1-3TB each depends on drives. For those who puts 1000 of mailboxes i recommend dovecot with mdbox storage backend I was dealing with the actual byte counts ... that figure is going to be in whole blocks. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On 2013-01-23 21:22, Wojciech Puchar wrote: While RAID-Z is already a king of bad performance, I don't believe RAID-Z is any worse than RAID5. Do you have any actual measurements to back up your claim? it is clearly described even in ZFS papers. Both on reads and writes it gives single drive random I/O performance. With ZFS and RAID-Z the situation is a bit more complex. Lets assume 5 disk raidz1 vdev with ashift=9 (512 byte sectors). A worst case scenario could happen if your random i/o workload was reading random files each of 2048 bytes. Each file read would require data from 4 disks (5th is parity and won't be read unless there are errors). However if files were 512 bytes or less then only one disk would be used. 1024 bytes - two disks, etc. So ZFS is probably not the best choice to store millions of small files if random access to whole files is the primary concern. But lets look at a different scenario - a PostgreSQL database. Here table data is split and stored in 1GB files. ZFS splits the file into 128KiB records (recordsize property). This record is then again split into 4 columns each 32768 bytes. 5th column is generated containing parity. Each column is then stored on a different disk. You could think of it as a regular RAID-5 with stripe size of 32768 bytes. PostgreSQL uses 8192 byte pages that fit evenly both into ZFS record size and column size. Each page access requires only a single disk read. Random i/o performance here should be 5 times that of a single disk. For me the reliability ZFS offers is far more important than pure performance. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
then stored on a different disk. You could think of it as a regular RAID-5 with stripe size of 32768 bytes. PostgreSQL uses 8192 byte pages that fit evenly both into ZFS record size and column size. Each page access requires only a single disk read. Random i/o performance here should be 5 times that of a single disk. think about writing 8192 byte pages randomly. and then doing linear search over table. For me the reliability ZFS offers is far more important than pure performance. Except it is on paper reliability. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
Wow!.! OK. It sounds like you (or someone like you) can answer some of my burning questions about ZFS. On Thu, Jan 24, 2013 at 8:12 AM, Adam Nowacki nowa...@platinum.linux.plwrote: Lets assume 5 disk raidz1 vdev with ashift=9 (512 byte sectors). A worst case scenario could happen if your random i/o workload was reading random files each of 2048 bytes. Each file read would require data from 4 disks (5th is parity and won't be read unless there are errors). However if files were 512 bytes or less then only one disk would be used. 1024 bytes - two disks, etc. So ZFS is probably not the best choice to store millions of small files if random access to whole files is the primary concern. But lets look at a different scenario - a PostgreSQL database. Here table data is split and stored in 1GB files. ZFS splits the file into 128KiB records (recordsize property). This record is then again split into 4 columns each 32768 bytes. 5th column is generated containing parity. Each column is then stored on a different disk. You could think of it as a regular RAID-5 with stripe size of 32768 bytes. Ok... so my question then would be... what of the small files. If I write several small files at once, does the transaction use a record, or does each file need to use a record? Additionally, if small files use sub-records, when you delete that file, does the sub-record get moved or just wasted (until the record is completely free)? I'm considering the difference, say, between cyrus imap (one file per message ZFS, database files on different ZFS filesystem) and dbmail imap (postgresql on ZFS). ... now I realize that PostgreSQL on ZFS has some special issues (but I don't have a choice here between ZFS and non-ZFS ... ZFS has already been chosen), but I'm also figuring that PostgreSQL on ZFS has some waste compared to cyrus IMAP on ZFS. So far in my research, Cyrus makes some compelling arguments that the common use case of most IMAP database files is full scan --- for which it's database files are optimized and SQL-based files are not. I agree that some operations can be more efficient in a good SQL database, but full scan (as a most often used query) is not. Cyrus also makes sense to me as a collection of small files ... for which I expect ZFS to excel... including the ability to snapshot with impunity... but I am terribly curious how the files are handled in transactions. I'm actually (right now) running some filesize statistics (and I'll get back to the list, if asked), but I'd like to know how ZFS is going to store the arriving mail... :). ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
several small files at once, does the transaction use a record, or does each file need to use a record? Additionally, if small files use sub-records, when you delete that file, does the sub-record get moved or just wasted (until the record is completely free)? writes of small files are always good with ZFS. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On 2013-01-24 15:24, Wojciech Puchar wrote: For me the reliability ZFS offers is far more important than pure performance. Except it is on paper reliability. This on paper reliability in practice saved a 20TB pool. See one of my previous emails. Any other filesystem or hardware/software raid without per-disk checksums would have failed. Silent corruption of non-important files would be the best case, complete filesystem death by important metadata corruption as the worst case. I've been using ZFS for 3 years in many systems. Biggest one has 44 disks and 4 ZFS pools - this one survived SAS expander disconnects, a few kernel panics and countless power failures (UPS only holds for a few hours). So far I've not lost a single ZFS pool or any data stored. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On 2013-01-24 15:45, Zaphod Beeblebrox wrote: Ok... so my question then would be... what of the small files. If I write several small files at once, does the transaction use a record, or does each file need to use a record? Additionally, if small files use sub-records, when you delete that file, does the sub-record get moved or just wasted (until the record is completely free)? Each file is a fully self-contained object (together with full parity) all the way to the physical storage. A 1 byte file on RAID-Z2 pool will always use 3 disks, 3 sectors total for data alone. You can use du to verify - it reports physical size together with parity. Metadata like directory entry or file attributes is stored separately and shared with other files. For small files there may be a lot of wasted space. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
Ok... here's the existing data: There are 3,236,316 files summing to 97,500,008,691 bytes. That puts the average file at 30,127 bytes. But for the full breakdown: 512 : 7758 1024 : 139046 2048 : 1468904 4096 : 325375 8192 : 492399 16384 : 324728 32768 : 263210 65536 : 102407 131072 : 43046 262144 : 22259 524288 : 17136 1048576 : 13788 2097152 : 8279 4194304 : 4501 8388608 : 2317 16777216 : 1045 33554432 : 119 67108864 : 2 I produced that list with the output of ls -R's byte counts, sorted and then processed with: (while read num; do count=$[count+1]; if [ $num -gt $size ]; then echo $size : $count;size=$[size*2]; count=0; fi; done) imapfilesizelist ... now the new machine has two 2T disks in a ZFS mirror --- so I suppose it won't waste as much space as a RAID-Z ZFS --- in that files less than 512 bytes will take 512 bytes? By far the most common case is 2048 bytes ... so that would indicate that a RAID-Z larger than 5 disks would waste much space. Does that go to your recomendations on vdev size, then? To have an 8 or 9 disk vdev, you should be storing at smallest 4k files? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
So far I've not lost a single ZFS pool or any data stored. so far my house wasn't robbed. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
There are 3,236,316 files summing to 97,500,008,691 bytes. That puts the average file at 30,127 bytes. But for the full breakdown: quite low. what do you store. here is my real world production example of users mail as well as documents. /dev/mirror/home1.eli 2788 1545 124355% 1941057 209811818% /home ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On Thu, Jan 24, 2013 at 2:26 PM, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl wrote: There are 3,236,316 files summing to 97,500,008,691 bytes. That puts the average file at 30,127 bytes. But for the full breakdown: quite low. what do you store. Apparently you're not really following this thread... just trolling? I had said that it was cyrus IMAP data (which, for reference, is one file per email message). here is my real world production example of users mail as well as documents. /dev/mirror/home1.eli 2788 1545 124355% 1941057 209811818% /home Not the same data, I imagine. I was dealing with the actual byte counts ... that figure is going to be in whole blocks. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On Jan 24, 2013, at 4:24 PM, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl wrote: Except it is on paper reliability. This on paper reliability saved my ass numerous times. For example I had one home NAS server machine with flaky SATA controller that would not detect one of the four drives from time to time on reboot. This made my pool degraded several times, and even rebooting with let's say disk4 failed to a situation that disk3 is failed did not corrupt any data. I don't think this is possible with any other open source FS, let alone hardware RAID that would drop the whole array because of this. I have never ever personally lost any data on ZFS. Yes, the performance is another topic, and you must know what you are doing, and what is your usage pattern, but from reliability standpoint, to me ZFS looks more durable than anything else. P.S.: My home NAS is running freebsd-CURRENT with ZFS from the first version available. Several drives died, two times the pool was expanded by replacing all drives one by one and resilvered, no single byte lost. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
While RAID-Z is already a king of bad performance, I don't believe RAID-Z is any worse than RAID5. Do you have any actual measurements to back up your claim? it is clearly described even in ZFS papers. Both on reads and writes it gives single drive random I/O performance. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
This is because RAID-Z spreads each block out over all disks, whereas RAID5 (as it is typically configured) puts each block on only one disk. So to read a block from RAID-Z, all data disks must be involved, vs. for RAID5 only one disk needs to have its head moved. For other workloads (especially streaming reads/writes), there is no fundamental difference, though of course implementation quality may vary. streaming workload generally is always good. random I/O is what is important. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On 23 Jan 2013 20:23, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl wrote: While RAID-Z is already a king of bad performance, I don't believe RAID-Z is any worse than RAID5. Do you have any actual measurements to back up your claim? it is clearly described even in ZFS papers. Both on reads and writes it gives single drive random I/O performance. So we have to take your word for it? Provide a link if you're going to make assertions, or they're no more than your own opinion. Chris ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On Wed, 23 Jan 2013 14:26:43 -0600, Chris Rees utis...@gmail.com wrote: So we have to take your word for it? Provide a link if you're going to make assertions, or they're no more than your own opinion. I've heard this same thing -- every vdev == 1 drive in performance. I've never seen any proof/papers on it though. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On Wed, Jan 23, 2013 at 12:22 PM, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl wrote: While RAID-Z is already a king of bad performance, I don't believe RAID-Z is any worse than RAID5. Do you have any actual measurements to back up your claim? it is clearly described even in ZFS papers. Both on reads and writes it gives single drive random I/O performance. For reads - true. For writes it's probably behaves better than RAID5 as it does not have to go through read-modify-write for partial block updates. Search for RAID-5 write hole. If you need higher performance, build your pool out of multiple RAID-Z vdevs. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On Wed, Jan 23, 2013 at 1:09 PM, Mark Felder f...@feld.me wrote: On Wed, 23 Jan 2013 14:26:43 -0600, Chris Rees utis...@gmail.com wrote: So we have to take your word for it? Provide a link if you're going to make assertions, or they're no more than your own opinion. I've heard this same thing -- every vdev == 1 drive in performance. I've never seen any proof/papers on it though. 1 drive in performance only applies to number of random i/o operations vdev can perform. You still get increased throughput. I.e. 5-drive RAIDZ will have 4x bandwidth of individual disks in vdev, but would deliver only as many IOPS as the slowest drive as record would have to be read back from N-1 or N-2 drived in vdev. It's the same for RAID5. IMHO for identical record/block size RAID5 has no advantage over RAID-Z for reads and does have disadvantage when it comes to small writes. Never mind lack of data integrity checks and other bells and whistles ZFS provides. --Artem ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
I've heard this same thing -- every vdev == 1 drive in performance. I've never seen any proof/papers on it though. read original ZFS papers. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
gives single drive random I/O performance. For reads - true. For writes it's probably behaves better than RAID5 yes, because as with reads it gives single drive performance. small writes on RAID5 gives lower than single disk performance. If you need higher performance, build your pool out of multiple RAID-Z vdevs. even you need normal performance use gmirror and UFS ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On 23 January 2013 21:24, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl wrote: I've heard this same thing -- every vdev == 1 drive in performance. I've never seen any proof/papers on it though. read original ZFS papers. No, you are making the assertion, provide a link. Chris ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
1 drive in performance only applies to number of random i/o operations vdev can perform. You still get increased throughput. I.e. 5-drive RAIDZ will have 4x bandwidth of individual disks in vdev, but unless your work is serving movies it doesn't matter. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On Wed, 23 Jan 2013 14:26:43 -0600, Chris Rees utis...@gmail.com wrote: So we have to take your word for it? Provide a link if you're going to make assertions, or they're no more than your own opinion. I've heard this same thing -- every vdev == 1 drive in performance. I've never seen any proof/papers on it though. first google answer from request raids performance https://blogs.oracle.com/roch/entry/when_to_and_not_to Effectively, as a first approximation, an N-disk RAID-Z group will behave as a single device in terms of deliveredrandom input IOPS. Thus a 10-disk group of devices each capable of 200-IOPS, will globally act as a 200-IOPS capable RAID-Z group. This is the price to pay to achieve proper data protection without the 2X block overhead associated with mirroring. -- Michel Talon ta...@lpthe.jussieu.fr smime.p7s Description: S/MIME cryptographic signature
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On Wed, Jan 23, 2013 at 1:25 PM, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl wrote: gives single drive random I/O performance. For reads - true. For writes it's probably behaves better than RAID5 yes, because as with reads it gives single drive performance. small writes on RAID5 gives lower than single disk performance. If you need higher performance, build your pool out of multiple RAID-Z vdevs. even you need normal performance use gmirror and UFS I've no objection. If it works for you -- go for it. For me personally ZFS performance is good enough, and data integrity verification is something that I'm willing to sacrifice some performance for. ZFS scrub gives me either warm and fuzzy feeling that everything is OK, or explicitly tells me that something bad happened *and* reconstructs the data if it's possible. Just my $0.02, --Artem ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On Jan 23, 2013, at 11:09 PM, Mark Felder f...@feld.me wrote: On Wed, 23 Jan 2013 14:26:43 -0600, Chris Rees utis...@gmail.com wrote: So we have to take your word for it? Provide a link if you're going to make assertions, or they're no more than your own opinion. I've heard this same thing -- every vdev == 1 drive in performance. I've never seen any proof/papers on it though. ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org Here is a blog post that describes why this is true for IOPS: http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On 23 Jan 2013 21:45, Michel Talon ta...@lpthe.jussieu.fr wrote: On Wed, 23 Jan 2013 14:26:43 -0600, Chris Rees utis...@gmail.com wrote: So we have to take your word for it? Provide a link if you're going to make assertions, or they're no more than your own opinion. I've heard this same thing -- every vdev == 1 drive in performance. I've never seen any proof/papers on it though. first google answer from request raids performance https://blogs.oracle.com/roch/entry/when_to_and_not_to Effectively, as a first approximation, an N-disk RAID-Z group will behave as a single device in terms of deliveredrandom input IOPS. Thus a 10-disk group of devices each capable of 200-IOPS, will globally act as a 200-IOPS capable RAID-Z group. This is the price to pay to achieve proper data protection without the 2X block overhead associated with mirroring. Thanks for the link, but I could have done that; I am attempting to explain to Wojciech that his habit of making bold assertions and arrogantly refusing to back them up makes for frustrating reading. Chris ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
associated with mirroring. Thanks for the link, but I could have done that; I am attempting to explain to Wojciech that his habit of making bold assertions and as you can see it is not a bold assertion, just you use something without even reading it's docs. Not mentioning doing any more research. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
even you need normal performance use gmirror and UFS I've no objection. If it works for you -- go for it. both works. For todays trend of solving everything by more hardware ZFS may even have enough performance. But still it is dangerous for a reasons i explained, as well as it promotes bad setups and layouts like making single filesystem out of large amount of disks. This is bad for no matter what filesystem and RAID setup you use, or even what OS. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On 01/23/13 14:27, Wojciech Puchar wrote: both works. For todays trend of solving everything by more hardware ZFS may even have enough performance. But still it is dangerous for a reasons i explained, as well as it promotes bad setups and layouts like making single filesystem out of large amount of disks. This is bad for no matter what filesystem and RAID setup you use, or even what OS. ZFS mirror performance is quite good (both random IO and sequential), and resilvers/scrubs are measured in an hour or less. You can always make pool out of these instead of RAIDZ if you can get away with less total available space. I think RAIDZ vs Gmirror is a bad comparison, you can use a ZFS mirror with all the ZFS features, plus N-way (not sure if gmirror does this). Regarding single large filesystems, there is an old saying about not putting all your eggs into one basket, even if it's a great basket :) Matt ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On Mon, Jan 21, 2013 at 11:36 PM, Peter Jeremy pe...@rulingia.com wrote: On 2013-Jan-21 12:12:45 +0100, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl wrote: While RAID-Z is already a king of bad performance, I don't believe RAID-Z is any worse than RAID5. Do you have any actual measurements to back up your claim? Leaving aside anecdotal evidence (or actual measurements), RAID-Z is fundamentally slower than RAID4/5 *for random reads*. This is because RAID-Z spreads each block out over all disks, whereas RAID5 (as it is typically configured) puts each block on only one disk. So to read a block from RAID-Z, all data disks must be involved, vs. for RAID5 only one disk needs to have its head moved. For other workloads (especially streaming reads/writes), there is no fundamental difference, though of course implementation quality may vary. Even better - use UFS. To each their own. As a ZFS developer, it should come as no surprise that in my opinion and experience, the benefits of ZFS almost always outweigh this downside. --matt ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
Please don't misinterpret this post: ZFS's ability to recover from fairly catastrophic failures is pretty stellar, but I'm wondering if there can be from my testing it is exactly opposite. You have to see a difference between marketing and reality. a little room for improvement. I use RAID pretty much everywhere. I don't like to loose data and disks are cheap. I have a fair amount of experience with all flavors ... and ZFS just like me. And because i want performance and - as you described - disks are cheap - i use RAID-1 (gmirror). has become a go-to filesystem for most of my applications. My applications doesn't tolerate low performance, overcomplexity and high risk of data loss. That's why i use properly tuned UFS, gmirror, and prefer not to use gstripe but have multiple filesystems One of the best recommendations I can give for ZFS is it's crash-recoverability. Which is marketing, not truth. If you want bullet-proof recoverability, UFS beats everything i've ever seen. If you want FAST crash recovery, use softupdates+journal, available in FreeBSD 9. As a counter example, if you have most hardware RAID going or a software whole-disk raid, after a crash it will generally declare one disk as good and the other disk as to be repaired ... after which a full surface scan of the affected disks --- reading one and writing the other --- ensues. true. gmirror do it, but you can defer mirror rebuild, which i use. I have a script that send me a mail when gmirror is degraded, and i - after finding out the cause of problem, and possibly replacing disk - run rebuild after work hours, so no slowdown is experienced. ZFS is smart on this point: it will recover on reboot with a minimum amount of fuss. Even if you dislodge a drive ... so that it's missing the last 'n' transactions, ZFS seems to figure this out (which I thought was extra cudos). Yes this is marketing. practice is somehow different. as you discovered yourself. MY PROBLEM comes from problems that scrub can fix. Let's talk, in specific, about my home array. It has 9x 1.5T and 8x 2T in a RAID-Z configuration (2 sets, obviously). While RAID-Z is already a king of bad performance, i assume you mean two POOLS, not 2 RAID-Z sets. if you mixed 2 different RAID-Z pools you would spread load unevenly and make performance even worse. A full scrub of my drives weighs in at 36 hours or so. which is funny as ZFS is marketed as doing this efficient (like checking only used space). dd if=/dev/disk of=/dev/null bs=2m would take no more than a few hours. and you may do all in parallel. vr2/cvs:0x1c1 Now ... this is just an example: after each scrub, the hex number was seems like scrub simply not do it's work right. before the old error was cleared. Then this new error gets similarly cleared by the next scrub. It seems that if the scrub returned to this new found error after fixing the known errors, this could save whole new scrub runs from being required. Even better - use UFS. For both bullet proof recoverability and performance. If you need help in tuning you may ask me privately. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
On 2013-Jan-21 12:12:45 +0100, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl wrote: That's why i use properly tuned UFS, gmirror, and prefer not to use gstripe but have multiple filesystems When I started using ZFS, I didn't fully trust it so I had a gmirrored UFS root (including a full src tree). Over time, I found that gmirror plus UFS was giving me more problems than ZFS. In particular, I was seeing behaviour that suggested that the mirrors were out of sync, even though gmirror insisted they were in sync. Unfortunately, there is no way to get gmirror to verify the mirroring or to get UFS to check correctness of data or metadata (fsck can only check metadata consistency). I've since moved to a ZFS root. Which is marketing, not truth. If you want bullet-proof recoverability, UFS beats everything i've ever seen. I've seen the opposite. One big difference is that ZFS is designed to ensure it returns the data that was written to it whereas UFS just returns the bytes it finds where it thinks it wrote your data. One side effect of this is that ZFS is far fussier about hardware quality - since it checksums everything, it is likely to pick up glitches that UFS doesn't notice. If you want FAST crash recovery, use softupdates+journal, available in FreeBSD 9. I'll admit that I haven't used SU+J but one downside of SU+J is that it prevents the use of snapshots, which in turn prevents the (safe) use of dump(8) (which is the official tool for UFS backups) on live filesystems. of fuss. Even if you dislodge a drive ... so that it's missing the last 'n' transactions, ZFS seems to figure this out (which I thought was extra cudos). Yes this is marketing. practice is somehow different. as you discovered yourself. Most of the time this works as designed. It's possible there are bugs in the implementation. While RAID-Z is already a king of bad performance, I don't believe RAID-Z is any worse than RAID5. Do you have any actual measurements to back up your claim? i assume you mean two POOLS, not 2 RAID-Z sets. if you mixed 2 different RAID-Z pools you would spread load unevenly and make performance even worse. There's no real reason why you could't have 2 different vdevs in the same pool. A full scrub of my drives weighs in at 36 hours or so. which is funny as ZFS is marketed as doing this efficient (like checking only used space). It _does_ only check used space but it does so in logical order rather than physical order. For a fragmented pool, this means random accesses. Even better - use UFS. Then you'll never know that your data has been corrupted. For both bullet proof recoverability and performance. use ZFS. -- Peter Jeremy pgpo1y4DGw4Rb.pgp Description: PGP signature
ZFS regimen: scrub, scrub, scrub and scrub again.
Please don't misinterpret this post: ZFS's ability to recover from fairly catastrophic failures is pretty stellar, but I'm wondering if there can be a little room for improvement. I use RAID pretty much everywhere. I don't like to loose data and disks are cheap. I have a fair amount of experience with all flavors ... and ZFS has become a go-to filesystem for most of my applications. One of the best recommendations I can give for ZFS is it's crash-recoverability. As a counter example, if you have most hardware RAID going or a software whole-disk raid, after a crash it will generally declare one disk as good and the other disk as to be repaired ... after which a full surface scan of the affected disks --- reading one and writing the other --- ensues. On my Windows desktop, the pair of 2T's take 3 or 4 hours to do this. A pair of green 2T's can take over 6. You don't loose any data, but you have severely reduced performance until it's repaired. The rub is that you know only one or two blocks could possibly even be different ... and that this is a highly unoptimized way of going about the problem. ZFS is smart on this point: it will recover on reboot with a minimum amount of fuss. Even if you dislodge a drive ... so that it's missing the last 'n' transactions, ZFS seems to figure this out (which I thought was extra cudos). MY PROBLEM comes from problems that scrub can fix. Let's talk, in specific, about my home array. It has 9x 1.5T and 8x 2T in a RAID-Z configuration (2 sets, obviously). The drives themselves are housed (4 each) in external drive bays with a single SATA connection for each. I think I have spoken of this here before. A full scrub of my drives weighs in at 36 hours or so. Now around Christmas, while moving some things, I managed to pull the plug on one cabinet of 4 drives. It was likely that the only active use of the filesystem was an automated cvs checkin (backup) given that the errors only appeared on the cvs directory. IN-THE-END, no data was lost, but I had to scrub 4 times to remove the complaints, which showed like this from zpool status -v errors: Permanent errors have been detected in the following files: vr2/cvs:0x1c1 Now ... this is just an example: after each scrub, the hex number was different. I also couldn't actually find the error on the cvs filesystem, as a side note. Not many files are stored there, and they all seemed to be present. MY TAKEAWAY from this is that 2 major improvements could be made to ZFS: 1) a pause for scrub... such that long scrubs could be paused during working hours. 2) going back over errors... during each scrub, the new error was found before the old error was cleared. Then this new error gets similarly cleared by the next scrub. It seems that if the scrub returned to this new found error after fixing the known errors, this could save whole new scrub runs from being required. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ZFS regimen: scrub, scrub, scrub and scrub again.
Hi, On 01/20/13 23:26, Zaphod Beeblebrox wrote: 1) a pause for scrub... such that long scrubs could be paused during working hours. While not exactly pause, but isn't playing with scrub_delay works here? vfs.zfs.scrub_delay: Number of ticks to delay scrub Set this to a high value during working hours, and set back to its normal (or even below) value off working hours. (maybe resilver delay, or some other values should also be set, I haven't yet read the relevant code) ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org