Re: [zfs-discuss] ZFS receive checksum mismatch
On Jun 10, 2011, at 8:59 AM, David Magda wrote: > On Fri, June 10, 2011 07:47, Edward Ned Harvey wrote: > >> #1 A single bit error causes checksum mismatch and then the whole data >> stream is not receivable. > > I wonder if it would be worth adding a (toggleable?) forward error > correction (FEC) [1] scheme to the 'zfs send' stream. pipes are your friend! -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS receive checksum mismatch
> I stored a snapshot stream to a file The tragic irony here is that the file was stored on a non-zfs filesystem. You had had undetected bitrot which unknowingly corrupted the stream. Other files also might have been silently corrupted as well. You may have just made one of the strongest cases yet for zfs and its assurances. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs any zfs-related programs, eats all RAM and dies in swapping hell
On Jun 10, 2011 11:52 AM, "Jim Klimov" wrote: > > 2011-06-10 18:00, Steve Gonczi пишет: >> >> Hi Jim, >> >> I wonder what OS version you are running? >> >> There was a problem similar to what you are describing in earlier versions >> in the 13x kernel series. >> >> Should not be present in the 14x kernels. > > > It is OpenIndiana oi_148a, and unlike many other details - > this one was in my email post today ;) > >> I missed the system parameters in your earlier emails. > > > Other config info: one dual-core P4 @2.8Ghz or so, 8Gb RAM > (max for the motherboard), 6*2Tb Seagate ST2000DL003 disks > in raidz2 plus an old 80Gb disk for the OS and swap. > > This system turned overnight from a test box to a backup > of some of my old-but-needed files, to their only storage > after the original server was cleaned. So I do want this > box to work reliably and not lose the data which is on it > already, and without dedup at 1.5x I'm running close to > not fitting in these 8Tb ;) > >> The usual recommended solution is >> "oh just do not use dedup, it is not production ready" > > > Well, alas, I am getting to the same conclusion so far... > I have never seen any remotely similar issues on any other > servers I maintain, but this one is the first guinea-pig > box for dedup. I guess for the next few years (until 128Gb > RAM or so would become the norm for a cheap home enthusiast > NAS) this one may be the last, too ;) > >> Using Dedup is more or less hopeless with less than 16Gigs of memory. >> (More is better) > > > Well... 8Gb RAM is not as low-end configuration as some > others discussed in home-NAS-with-ZFS blogs which claimed > to have used dedup when it first came out. :) > > Since there's little real information about DDT appetites > so far (this list is still buzzing with calculations and > tests), I assumed that 8Gb RAM is a reasonable amount for > starters. At least it would have been for, say, Linux or > other open-source enthusiasts communities, which have to > do with whatever crappy hardware they got second-handed ;) > I was in that camp, and I know that personal budgets do > not often assume more than 1-2k$ per box. Until recently > that included very little RAM. And as I said, this is the > maximum which can be put into my motherboard anyway. > > Anyhow, this box has two pools at the moment: > * "pool" is the physical raidz2 on 6 disks with ashift=12 > Some datasets on pool are deduped and compressed (lzjb). > * "dcpool" is built in a compressed volume inside "pool", > which is loopback-mounted over iSCSI and in the resulting > disk I made another pool with deduped datasets. > > This "dcpool" was used to test the idea about separating > compression and deduplication (so that dedup decisions > are made about raw source data, and after that whatever > has to be written is compressed - once). > > The POC worked somewhat - except that I used small block > sizes in the "pool/dcpool" volume, so ZFS metadata to > address the volume blocks takes up as much as the userdata. > > Performance was abysmal - around 1Mb/sec to write into > "dcpool" lately, and not much faster to read it, so for > the past month I am trying to evacuate my data back from > "dcpool" into "pool" which performs faster - about 5-10Mb/s > during these copies between pools, and according to iostat, > the physical harddisks are quite busy (over 60%) while > "dcpool" is often stuck at 100% busy with several seconds(!) > of wait times and zero IO operations. The physical "pool" > datasets without dedup performed at "wirespeed" 40-50Mb/s > when I was copying files over CIFS from another computer > over a home gigabit LAN. Since this box is primarily an > archive storage with maybe a few datasets dedicated to > somewhat active data (updates to photo archive), slow > but infrequent IO to deduped data is okay with me -- > as long as the system doesn't crash as it does now. > > Partially, this is why I bumped up TXG Sync intervals > to 30 seconds - so that ZFS would have more data in > buffers after slow IO to coalesce and try to minimize > fragmentation and mechanical IOPS involved. > > >> The issue was an incorrectly sized buffer that caused ZFS to be waiting too long >> for a buffer allocation. I can dig up the bug number and the fix description >> if you are running someting 130-ish. > > > Unless this fix was not integrated into OI for some reason, > I am afraid digging it up would be of limited help. Still, > I would be interested to read the summary, postmortem and > workarounds. > > Maybe this was broken again in OI by newer "improvements"? ;) > >> The thing I would want to check is the sync times and frequencies. >> You can dtrace (and timestamp) this. > > > Umm... could you please suggest a script, preferably one > that I can leave running on console and printing stats > every second or so? > > I can only think of "time sync" so far. And I also think > it could be counted in a few seconds or more. > > >> I would suspect when the bad
Re: [zfs-discuss] ZFS receive checksum mismatch
On Fri, Jun 10, 2011 at 8:59 AM, Jim Klimov wrote: > Is such "tape" storage only intended for reliable media such as > another ZFS or triple-redundancy tape archive with fancy robotics? > How would it cope with BER in transfers to/from such media? Large and small businesses have been using TAPE as a BACKUP media for decades. One of the cardinal rules is that you MUST have at least TWO FULL copies if you expect to be able to use them. An Incremental backup is marginally better than an incremental zfs send in that you _can_ recover the files contained in the backup image. I understand why a zfs send is what it is (and you can't pull individual files out of it), and that it must be bit for bit correct, and that IF it is large, then the chances of a bit error are higher. But given all that, I still have not heard a good reason NOT to keep zfs send stream images around as insurance. Yes, they must not be corrupt (that is true for ANY backup storage), and if they do get corrupted you cannot (without tweeks that may jeopardize the data integrity) "restore" that stream image. But this really is not a higher bar than for any other "backup" system. This is why I wondered at the original posters comment that he had made a critical mistake (unless the mistake was using storage for the image that had a high chance of corruption and did not have a second copy of the image). Sorry if that has been discussed here before, how much of this list I get to read depends on how busy I am. Right now I am very busy moving 20 TB of data from one configuration of 14 zpools to a configuration of one zpool (and only one dataset, no zfs send / recv for me), so I have lots of time to wait, and I spend some of that time reading this list :-) P.S. This data is "backed up", both the old and new configuration via regular zfs snapshots (for day to day needs) and zfs send / recv replication to a remote site (for DR needs). The initial zfs full send occurred when the new zpool was new and empty, so I only have to push the incrementals through the WAN link. -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS receive checksum mismatch
2011-06-10 20:58, Marty Scholes пишет: If it is true that unlike ZFS itself, the replication stream format has no redundancy (even of ECC/CRC sort), how can it be used for long-term retention "on tape"? It can't. I don't think it has been documented anywhere, but I believe that it has been well understood that if you don't trust your storage (tape, disk, floppies, punched cards, whatever), then you shouldn't trust your incremental streams on that storage. Well, the whole point of this redundancy in ZFS is about not trusting any storage (maybe including RAM at some time - but so far it is requested to be ECC RAM) ;) Hell, we don't ultimately trust any storage... Oops, I forgot what I wanted to say next ;) It's as if the ZFS design assumed that all incremental streams would be either perfect or retryable. Yup. Seems like another ivory-tower assumption ;) This is a huge problem for tape retention, not so much for disk retention. Because why? You can make mirrors or raidz of disks? On a personal level I have handled this with a separate pool of fewer, larger and slower drives which serves solely as backup, taking incremental streams from the main pool every 20 minutes or so. Unfortunately that approach breaks the legacy backup strategy of pretty much every company. I'm afraid it also breaks backups of petabyte-sized arrays where it is impractical to double or triple the number of racks with spinning drives, but is practical to have a closet full of tapes for the automated robot-feeded ;) I think the message is that unless you can ensure the integrity of the stream, either backups should go to another pool or zfs send/receive should not be a critical part of the backup strategy. Or that zfs streams can be improved to VALIDLY become part of such strategy. Regarding the checksums in ZFS, as of now I guess we can send the ZFS streams to a file, compress this file with ZIP, RAR or some other format with CRC and some added "recoverability" (i.e. WinRAR claims to be able to repair about 1% of erroneous file data with standard settings) and send these ZIP/RAR archives to the tape. Obviously, a standard integrated solution within ZFS would be better and more portable. See FEC suggestion from another poster ;) //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS receive checksum mismatch
> If it is true that unlike ZFS itself, the replication > stream format has > no redundancy (even of ECC/CRC sort), how can it be > used for > long-term retention "on tape"? It can't. I don't think it has been documented anywhere, but I believe that it has been well understood that if you don't trust your storage (tape, disk, floppies, punched cards, whatever), then you shouldn't trust your incremental streams on that storage. It's as if the ZFS design assumed that all incremental streams would be either perfect or retryable. This is a huge problem for tape retention, not so much for disk retention. On a personal level I have handled this with a separate pool of fewer, larger and slower drives which serves solely as backup, taking incremental streams from the main pool every 20 minutes or so. Unfortunately that approach breaks the legacy backup strategy of pretty much every company. I think the message is that unless you can ensure the integrity of the stream, either backups should go to another pool or zfs send/receive should not be a critical part of the backup strategy. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs any zfs-related programs, eats all RAM and dies in swapping hell
2011-06-10 18:00, Steve Gonczi пишет: Hi Jim, I wonder what OS version you are running? There was a problem similar to what you are describing in earlier versions in the 13x kernel series. Should not be present in the 14x kernels. It is OpenIndiana oi_148a, and unlike many other details - this one was in my email post today ;) I missed the system parameters in your earlier emails. Other config info: one dual-core P4 @2.8Ghz or so, 8Gb RAM (max for the motherboard), 6*2Tb Seagate ST2000DL003 disks in raidz2 plus an old 80Gb disk for the OS and swap. This system turned overnight from a test box to a backup of some of my old-but-needed files, to their only storage after the original server was cleaned. So I do want this box to work reliably and not lose the data which is on it already, and without dedup at 1.5x I'm running close to not fitting in these 8Tb ;) The usual recommended solution is "oh just do not use dedup, it is not production ready" Well, alas, I am getting to the same conclusion so far... I have never seen any remotely similar issues on any other servers I maintain, but this one is the first guinea-pig box for dedup. I guess for the next few years (until 128Gb RAM or so would become the norm for a cheap home enthusiast NAS) this one may be the last, too ;) Using Dedup is more or less hopeless with less than 16Gigs of memory. (More is better) Well... 8Gb RAM is not as low-end configuration as some others discussed in home-NAS-with-ZFS blogs which claimed to have used dedup when it first came out. :) Since there's little real information about DDT appetites so far (this list is still buzzing with calculations and tests), I assumed that 8Gb RAM is a reasonable amount for starters. At least it would have been for, say, Linux or other open-source enthusiasts communities, which have to do with whatever crappy hardware they got second-handed ;) I was in that camp, and I know that personal budgets do not often assume more than 1-2k$ per box. Until recently that included very little RAM. And as I said, this is the maximum which can be put into my motherboard anyway. Anyhow, this box has two pools at the moment: * "pool" is the physical raidz2 on 6 disks with ashift=12 Some datasets on pool are deduped and compressed (lzjb). * "dcpool" is built in a compressed volume inside "pool", which is loopback-mounted over iSCSI and in the resulting disk I made another pool with deduped datasets. This "dcpool" was used to test the idea about separating compression and deduplication (so that dedup decisions are made about raw source data, and after that whatever has to be written is compressed - once). The POC worked somewhat - except that I used small block sizes in the "pool/dcpool" volume, so ZFS metadata to address the volume blocks takes up as much as the userdata. Performance was abysmal - around 1Mb/sec to write into "dcpool" lately, and not much faster to read it, so for the past month I am trying to evacuate my data back from "dcpool" into "pool" which performs faster - about 5-10Mb/s during these copies between pools, and according to iostat, the physical harddisks are quite busy (over 60%) while "dcpool" is often stuck at 100% busy with several seconds(!) of wait times and zero IO operations. The physical "pool" datasets without dedup performed at "wirespeed" 40-50Mb/s when I was copying files over CIFS from another computer over a home gigabit LAN. Since this box is primarily an archive storage with maybe a few datasets dedicated to somewhat active data (updates to photo archive), slow but infrequent IO to deduped data is okay with me -- as long as the system doesn't crash as it does now. Partially, this is why I bumped up TXG Sync intervals to 30 seconds - so that ZFS would have more data in buffers after slow IO to coalesce and try to minimize fragmentation and mechanical IOPS involved. The issue was an incorrectly sized buffer that caused ZFS to be waiting too long for a buffer allocation. I can dig up the bug number and the fix description if you are running someting 130-ish. Unless this fix was not integrated into OI for some reason, I am afraid digging it up would be of limited help. Still, I would be interested to read the summary, postmortem and workarounds. Maybe this was broken again in OI by newer "improvements"? ;) The thing I would want to check is the sync times and frequencies. You can dtrace (and timestamp) this. Umm... could you please suggest a script, preferably one that I can leave running on console and printing stats every second or so? I can only think of "time sync" so far. And I also think it could be counted in a few seconds or more. I would suspect when the bad state occurs, your sync is taking a _very_ long time. When the VERY BAD state occurs, I can no longer use the system or test anything ;) When it nearly occurs, I have only a few seconds of uptime left, and since each run boot-to-crash takes roughly 2-3 hours now, I
Re: [zfs-discuss] ZFS receive checksum mismatch
On Fri, June 10, 2011 07:47, Edward Ned Harvey wrote: > #1 A single bit error causes checksum mismatch and then the whole data > stream is not receivable. I wonder if it would be worth adding a (toggleable?) forward error correction (FEC) [1] scheme to the 'zfs send' stream. Even if we're talking about a straight zfs send/recv pipe, and not saving to a file, it'd be handy as you wouldn't have restart a large transfer for a single bit error (especially for those long initial syncs of remote 'mirrors'). [1] http://en.wikipedia.org/wiki/Forward_error_correction ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ACARD ANS-9010 cannot work well LSI 9211-8i SAS inerface
Just want to share with you. We have found and been suffering from some weird issues because of it. It is better to connect them directly to the sata ports on the mainboard. Thanks. Fred ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs any zfs-related programs, eats all RAM and dies in swapping hell
2011-06-10 13:51, Jim Klimov пишет: and the system dies in swapping hell (scanrates for available pages were seen to go into millions, CPU context switches reach 200-300k/sec on a single dualcore P4) after eating the last stable-free 1-2Gb of RAM within a minute. After this the system responds to nothing except the reset button. I've captured an illustration for this today, with my watchdog as well as vmstat, top and other tools. Half a gigabyte in under one second - the watchdog never saw it coming :( My "freeram-watchdog" is based on vmstat but emphasizes deltas in "freeswap" and "freeram" values (see middle columns) and has less fields with more readable names ;) freq freeswap freeram scanrate Dswap Dram in sy cs us sy id 1 6652236 497088 0 0 -5380 3428 3645 3645 0 83 17 1 6652236 502112 0 0 5024 2332 2962 2962 1 72 27 1 6652236 494656 0 0 -7456 2886 3641 3641 0 78 21 1 6652236 502024 0 0 7368 3748 4197 4197 1 83 16 1 6652236 502316 0 0 292 4090 2516 2516 0 68 32 1 6652236 498388 0 0 -3928 2270 3940 3940 1 76 24 1 6652236 502264 0 0 3876 3097 3097 3097 0 76 23 1 6652236 495052 0 0 -7212 2705 2796 2796 1 86 14 1 6652236 502384 0 0 7332 3609 4449 4449 1 81 18 1 6652236 502292 0 0 -92 3639 2639 2639 1 80 19 1 6652236 92064 3435680 0 -410228 15665 1312 1312 0 99 0 In VMSTAT it looked like this: kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr f0 s2 s3 s4 in sy cs us sy id 0 0 0 6652236 495052 0 32 0 0 0 0 0 0 0 0 0 3037 4107 158598 0 90 10 0 0 0 6652236 502384 0 24 0 0 0 0 0 0 0 0 0 3266 2697 114195 1 78 21 6 0 0 6652236 502292 0 23 0 0 0 0 0 0 15 15 15 2947 3048 130070 0 87 13 29 0 0 6652236 92064 124 155 0 0 5084 0 3706374 0 0 0 0 16743 1244 2696 0 100 0 So, for a couple of minutes before the freeze, the system was rather stable at around 500mb free RAM. Before this couple of minutes it was stable at 1-1.2Gb, but then jumped down to 500m in about 10sec. And in the last second of known uptime, the system ate up at least 400Mb and began scanning for free pages at 3.7Mil scans/sec. Usually reaching this condition takes about 3-5 seconds, and I see "cs" going up to about 200k, and my watchdog has time to reboot the system. Not this time :( According to TOP, the free RAM dropped down to 32Mb (which in my older adventures was also the empirical lower limit of RAM when the system began scanrating to death), with zpool process ranking high - but I haven't yet seen pageout make it to top in its last second of life: last pid: 1786; load avg: 3.59, 2.10, 1.09; up 0+02:20:28 15:07:20 118 processes: 100 sleeping, 16 running, 2 on cpu CPU states: 0.0% idle, 0.4% user, 99.6% kernel, 0.0% iowait, 0.0% swap Kernel: 2807 ctxsw, 210 trap, 16730 intr, 1388 syscall, 161 flt Memory: 8191M phys mem, 32M free mem, 6655M total swap, 6655M free swap PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND 1464 root 138 99 -20 0K 0K sleep 2:01 30.93% zpool-dcpool 2 root 2 97 -20 0K 0K cpu/1 0:03 26.67% pageout 1220 root 1 59 0 4400K 2188K sleep 1:04 0.57% prstat 1477 root 1 59 0 2588K 1756K run 3:13 0.28% freeram-watchdo 3 root 1 60 -20 0K 0K sleep 0:17 0.21% fsflush 522 root 1 59 0 4172K 1000K run 0:35 0.20% top One way or another, such repeatable failure behaviour is very not acceptable for a production storage platform :( and I hope to see it fixed - if I can help somehow?.. I *think* one way to reproduce it would be: 1) Enable dedup (optional?) 2) Write lots of data to disk, i.e. 2-3Tb 3) Delete lots of data, or make and destroy a snapshot, or destroy a dataset with test data This puts the system into position with lots of processing of (not-yet-)deferred deletes. In my case this by itself often leads to RAM starvation and hangs and a following reset of the box; you can reset the TEST system during such delete processing. Now when you reboot and try to import this test pool, you should have a situation like mine - the pool does not quckly import, zfs-related commands hang, and in a few hours the box should die ;) iostat reports many small reads and occasional writes (starting after about 10 minutes into import) which gives me hope that the pool will come back online sometime... The current version of my software watchdog which saves some trouble for my assistant by catching near-freeze conditions, is here: * http://thumper.cos.ru/~jim/freeram-watchdog-20110610-v0.11.tgz -- ++ || | Климов Евгений, Jim Klimov | | технический директор CTO | | ЗАО "ЦОС и ВТ" JSC COS&HT | || | +7-903-7705859 (cellular) mailto:jimkli...@cos.ru | | CC:ad...@cos.ru,jimkli...@mail.ru | +
Re: [zfs-discuss] ZFS receive checksum mismatch
2011-06-10 15:58, Darren J Moffat пишет: As I pointed out last time this came up the NDMP service on Solaris 11 Express and on the Oracle ZFS Storage Appliance uses the 'zfs send' stream as what is to be stored on the "tape". This discussion turns interesting ;) Just curious: how do these products work around the stream fragility which we are discussing here - that a single-bit error can/will/should make the whole zfs send stream invalid, even though it is probably an error localized in a single block. This block is ultimately related to a file (or a few files in case of dedup or snapshots/clones) whose name "zfs recv" could report for an admin to take action such as rsync. If it is true that unlike ZFS itself, the replication stream format has no redundancy (even of ECC/CRC sort), how can it be used for long-term retention "on tape"? I understand about online transfers, somewhat. If the transfer failed, you still have the original to retry. But backups are often needed when the original is no longer alive, and that's why they are needed ;) And by Murphy's law that's when this single bit strikes ;) Is such "tape" storage only intended for reliable media such as another ZFS or triple-redundancy tape archive with fancy robotics? How would it cope with BER in transfers to/from such media? Also, an argument was recently posed (when I wrote of saving zfs send streams into files and transferring them by rsync over slow bad links), that for most online transfers I should better use zfs send of incremental snapshots. While I agree with this in terms that an incremental transfer is presumably smaller and has less chance of corruption (network failure) during transfer than a huge initial stream, this chance of corruption is still non-zero. Simply in case of online transfers I can detect the error and retry at low cost (or big cost - bandwidth is not free in many parts of the world). Going back to storing many streams (initial + increments) on tape - if an intermediate incremental stream has a single-bit error, then its snapshot and any which follow-up can not be received into zfs. Even if the "broken" block is later freed and discarded (equivalent to overwriting with a newer version of a file from a newer increment in classic backup systems with a file being the unit of backup). And since the total size of initial+incremental backups is likely larger than of a single full dump, the chance of a single corruption making your (latest) backup useless would be also higher, right? Thanks for clarifications, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS receive checksum mismatch
On 06/10/11 12:47, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Jonathan Walker New to ZFS, I made a critical error when migrating data and configuring zpools according to needs - I stored a snapshot stream to a file using "zfs send -R [filesystem]@[snapshot]>[stream_file]". There are precisely two reasons why it's not recommended to store a zfs send datastream for later use. As long as you can acknowledge and accept these limitations, then sure, go right ahead and store it. ;-) A lot of people do, and it's good. Not recommended by who ? Which documentation says this ? As I pointed out last time this came up the NDMP service on Solaris 11 Express and on the Oracle ZFS Storage Appliance uses the 'zfs send' stream as what is to be stored on the "tape". -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS receive checksum mismatch
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Jonathan Walker > > New to ZFS, I made a critical error when migrating data and > configuring zpools according to needs - I stored a snapshot stream to > a file using "zfs send -R [filesystem]@[snapshot] >[stream_file]". There are precisely two reasons why it's not recommended to store a zfs send datastream for later use. As long as you can acknowledge and accept these limitations, then sure, go right ahead and store it. ;-) A lot of people do, and it's good. #1 A single bit error causes checksum mismatch and then the whole data stream is not receivable. Obviously you encountered this problem already, and you were able to work around. If I were you, however, I would be skeptical about data integrity on your system. You said you scrubbed and corrected a couple of errors, but that's not actually possible. The filesystem integrity checksums are for detection, not correction, of corruption. The only way corruption gets corrected is when there's a redundant copy of the data... Then ZFS can discard the corrupt copy, overwrite with a good copy, and all the checksums suddenly match. Of course there is no such thing in the zfs send data stream - no redundant copy in the data stream. So yes, you have corruption. The best you can possibly do is to identify where it is, and then remove the affected files. #2 You cannot do a partial receive, nor generate a catalog of the files within the datastream. You can restore the whole filesystem or nothing. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS receive checksum mismatch
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Jim Klimov > > Besides, the format > is not public and subject to change, I think. So future compatibility > is not guaranteed. That is not correct. Years ago, there was a comment in the man page that said this: "The format of the stream is evolving. No backwards compatibility is guaranteed. You may not be able to receive your streams on future versions of ZFS." But in the last several years, backward/forward compatibility has always been preserved, so despite the warning, it was never a problem. In more recent versions, the man page says: "The format of the stream is committed. You will be able to receive your streams on future versions of ZFS." ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool import hangs any zfs-related programs, eats all RAM and dies in swapping hell
The subject says it all, more or less: due to some problems with a pool (i.e. deferred deletes a month ago, possibly similar now), the "zpool import" hangs any zfs-related programs, including "zfs", "zpool", "bootadm", sometimes "df". After several hours of disk-thrashing all 8Gb of RAM in the system is consumed (by kernel I guess, because "prstat" and "top" don't show any huge processes) and the system dies in swapping hell (scanrates for available pages were seen to go into millions, CPU context switches reach 200-300k/sec on a single dualcore P4) after eating the last stable-free 1-2Gb of RAM within a minute. After this the system responds to nothing except the reset button. ZDB walks were seen to take up over 20Gb of VM, but as the ZDB is a userland process - it could swap. I guess that the kernel is doing something similar in appetite - but can't swap out the kernel memory. So regarding the hanging ZFS-related programs, I think there's some bad locking involved (i.e. I should be able to see or config other pools beside the one being imported?), and regarding the VM depletion without swapping - that seems like a kernel problem. Partial problem is - the box is on "remote support" (while it is a home NAS, I am away from home for months - so my neighbor assists by walking in to push reset). While I was troubleshooting the problem I wrote a watchdog program based on vmstat, which catches bad conditions and calls uadmin(2) to force an ungraceful software reboot. Quite often it has not enough time to react, though - 1-second strobes into kernel VM stats are a very long period :( The least I can say is that this is very annoying, to the point that I am not sure what variant of Solaris to build my customers' and friends' NASes with. This box is curently on OI_148a with the updated ZFS package from Mar 2011, and while I am away I am not sure I can safely remotely update this box. Actually I wrote about this situation in detail on the forums, but that was before web-posts were forwarded to email so I never got any feedback. There's a lot of detailed text in these threads so I wouldn't go over all of it again now: * http://opensolaris.org/jive/thread.jspa?threadID=138604&tstart=0 <http://opensolaris.org/jive/thread.jspa?threadID=138604&tstart=0> * http://opensolaris.org/jive/thread.jspa?threadID=138740&tstart=0 <http://opensolaris.org/jive/thread.jspa?threadID=138740&tstart=0> Back then it took about a week of reboots for the "pool" to get finally imported, with no visible progress-tracker except running ZDB to see that deferred-free list is decreasing, and wondering if maybe it was the culprit (in the end of that problem, it was). I was also lucky that this ZFS cleanup from deferred-free blocks was cumulative and the gained progress survived over reboots. Currently I have little idea what is the problem with my "dcpool" (lives in a volume in "pool" and mounts over iSCSI) - ZDB did not finish yet, and two days of reboots every 3 hours did not fix the problem, the "dcpool" does not import yet. Since my box's OS is OpenIndiana, I started a few bugs to track these problems as well, with little activity from other posters: * https://www.illumos.org/issues/841 * https://www.illumos.org/issues/956 The current version of my software watchdog which saves some trouble for my assistant by catching near-freeze conditions, is here: * http://thumper.cos.ru/~jim/freeram-watchdog-20110610-v0.11.tgz I guess it is time for questions now :) What methods can I use (beside 20-hour-long ZDB walks) to gain a quick insight on the cause of problems - why doesn't the pool import quickly? Does it make and keep any progress while trying to import over numerous reboots? How much is left? Are there any tunables I did not try yet? Currently I have the following settings to remedy different performance and stability problems of this box: # cat /etc/system | egrep -v '\*|^$' set zfs:aok = 1 set zfs:zfs_recover = 1 set zfs:zfs_resilver_delay = 0 set zfs:zfs_resilver_min_time_ms = 2 set zfs:zfs_scrub_delay = 0 set zfs:zfs_arc_max=0x1a000 set zfs:arc_meta_limit = 0x18000 set zfs:zfs_arc_meta_limit = 0x18000 set zfs:metaslab_min_alloc_size = 0x8000 set zfs:metaslab_smo_bonus_pct = 0xc8 set zfs:zfs_write_limit_override = 0x1800 set zfs:zfs_txg_timeout = 30 set zfs:zfs_txg_synctime = 30 set zfs:zfs_vdev_max_pending = 5 Are my guesses about "this is a kernel problem" anyhow correct? Did by chance any related fixes make way into the development versions of newer OpenIndianas (148b, 151, pkg-dev repository) already? Thanks for any comments, condolenscences, insights, bugfixes ;) //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss