Re: [zfs-discuss] ZFS receive checksum mismatch

2011-06-10 Thread Richard Elling
On Jun 10, 2011, at 8:59 AM, David Magda wrote:

> On Fri, June 10, 2011 07:47, Edward Ned Harvey wrote:
> 
>> #1  A single bit error causes checksum mismatch and then the whole data
>> stream is not receivable.
> 
> I wonder if it would be worth adding a (toggleable?) forward error
> correction (FEC) [1] scheme to the 'zfs send' stream.

pipes are your friend!
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS receive checksum mismatch

2011-06-10 Thread Marty Scholes
> I stored a snapshot stream to a file

The tragic irony here is that the file was stored on a non-zfs filesystem.  You 
had had undetected bitrot which unknowingly corrupted the stream.  Other files 
also might have been silently corrupted as well.

You may have just made one of the strongest cases yet for zfs and its 
assurances.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs any zfs-related programs, eats all RAM and dies in swapping hell

2011-06-10 Thread Tim Cook
On Jun 10, 2011 11:52 AM, "Jim Klimov"  wrote:
>
> 2011-06-10 18:00, Steve Gonczi пишет:
>>
>> Hi Jim,
>>
>> I wonder what  OS version you are running?
>>
>> There was a problem similar to what you are describing in earlier
versions
>> in the 13x kernel series.
>>
>> Should not be present in the 14x kernels.
>
>
> It is OpenIndiana oi_148a, and unlike many other details -
> this one was in my email post today ;)
>
>> I missed the system parameters in your earlier emails.
>
>
> Other config info: one dual-core P4 @2.8Ghz or so, 8Gb RAM
> (max for the motherboard), 6*2Tb Seagate ST2000DL003 disks
> in raidz2 plus an old 80Gb disk for the OS and swap.
>
> This system turned overnight from a test box to a backup
> of some of my old-but-needed files, to their only storage
> after the original server was cleaned. So I do want this
> box to work reliably and not lose the data which is on it
> already, and without dedup at 1.5x I'm running close to
> not fitting in these 8Tb ;)
>
>> The usual  recommended solution is
>> "oh just do not use dedup, it is not production ready"
>
>
> Well, alas, I am getting to the same conclusion so far...
> I have never seen any remotely similar issues on any other
> servers I maintain, but this one is the first guinea-pig
> box for dedup. I guess for the next few years (until 128Gb
> RAM or so would become the norm for a cheap home enthusiast
> NAS) this one may be the last, too ;)
>
>> Using Dedup is more or less  hopeless with less than 16Gigs of memory.
>> (More is better)
>
>
> Well... 8Gb RAM is not as low-end configuration as some
> others discussed in home-NAS-with-ZFS blogs which claimed
> to have used dedup when it first came out. :)
>
> Since there's little real information about DDT appetites
> so far (this list is still buzzing with calculations and
> tests), I assumed that 8Gb RAM is a reasonable amount for
> starters. At least it would have been for, say, Linux or
> other open-source enthusiasts communities, which have to
> do with whatever crappy hardware they got second-handed ;)
> I was in that camp, and I know that personal budgets do
> not often assume more than 1-2k$ per box. Until recently
> that included very little RAM. And as I said, this is the
> maximum which can be put into my motherboard anyway.
>
> Anyhow, this box has two pools at the moment:
> * "pool" is the physical raidz2 on 6 disks with ashift=12
>   Some datasets on pool are deduped and compressed (lzjb).
> * "dcpool" is built in a compressed volume inside "pool",
>   which is loopback-mounted over iSCSI and in the resulting
>   disk I made another pool with deduped datasets.
>
> This "dcpool" was used to test the idea about separating
> compression and deduplication (so that dedup decisions
> are made about raw source data, and after that whatever
> has to be written is compressed - once).
>
> The POC worked somewhat - except that I used small block
> sizes in the "pool/dcpool" volume, so ZFS metadata to
> address the volume blocks takes up as much as the userdata.
>
> Performance was abysmal - around 1Mb/sec to write into
> "dcpool" lately, and not much faster to read it, so for
> the past month I am trying to evacuate my data back from
> "dcpool" into "pool" which performs faster - about 5-10Mb/s
> during these copies between pools, and according to iostat,
> the physical harddisks are quite busy (over 60%) while
> "dcpool" is often stuck at 100% busy with several seconds(!)
> of wait times and zero IO operations. The physical "pool"
> datasets without dedup performed at "wirespeed" 40-50Mb/s
> when I was copying files over CIFS from another computer
> over a home gigabit LAN. Since this box is primarily an
> archive storage with maybe a few datasets dedicated to
> somewhat active data (updates to photo archive), slow
> but infrequent IO to deduped data is okay with me --
> as long as the system doesn't crash as it does now.
>
> Partially, this is why I bumped up TXG Sync intervals
> to 30 seconds - so that ZFS would have more data in
> buffers after slow IO to coalesce and try to minimize
> fragmentation and mechanical IOPS involved.
>
>
>> The issue was an incorrectly sized buffer that caused ZFS to be waiting
too long
>> for a buffer allocation.  I can dig up the bug number and the fix
description
>> if you are running someting 130-ish.
>
>
> Unless this fix was not integrated into OI for some reason,
> I am afraid digging it up would be of limited help. Still,
> I would be interested to read the summary, postmortem and
> workarounds.
>
> Maybe this was broken again in OI by newer "improvements"? ;)
>
>> The thing I would want to check is the sync times and frequencies.
>> You can dtrace (and timestamp) this.
>
>
> Umm... could you please suggest a script, preferably one
> that I can leave running on console and printing stats
> every second or so?
>
> I can only think of "time sync" so far. And I also think
> it could be counted in a few seconds or more.
>
>
>> I would suspect when the bad

Re: [zfs-discuss] ZFS receive checksum mismatch

2011-06-10 Thread Paul Kraus
On Fri, Jun 10, 2011 at 8:59 AM, Jim Klimov  wrote:

> Is such "tape" storage only intended for reliable media such as
> another ZFS or triple-redundancy tape archive with fancy robotics?
> How would it cope with BER in transfers to/from such media?

Large and small businesses have been using TAPE as a BACKUP media
for decades. One of the cardinal rules is that you MUST have at least
TWO FULL copies if you expect to be able to use them. An Incremental
backup is marginally better than an incremental zfs send in that you
_can_ recover the files contained in the backup image.

I understand why a zfs send is what it is (and you can't pull
individual files out of it), and that it must be bit for bit correct,
and that IF it is large, then the chances of a bit error are higher.
But given all that, I still have not heard a good reason NOT to keep
zfs send stream images around as insurance. Yes, they must not be
corrupt (that is true for ANY backup storage), and if they do get
corrupted you cannot (without tweeks that may jeopardize the data
integrity) "restore" that stream image. But this really is not a
higher bar than for any other "backup" system. This is why I wondered
at the original posters comment that he had made a critical mistake
(unless the mistake was using storage for the image that had a high
chance of corruption and did not have a second copy of the image).

Sorry if that has been discussed here before, how much of this
list I get to read depends on how busy I am. Right now I am very busy
moving 20 TB of data from one configuration of 14 zpools to a
configuration of one zpool (and only one dataset, no zfs send / recv
for me), so I have lots of time to wait, and I spend some of that time
reading this list :-)

P.S. This data is "backed up", both the old and new configuration via
regular zfs snapshots (for day to day needs) and zfs send / recv
replication to a remote site (for DR needs). The initial zfs full send
occurred when the new zpool was new and empty, so I only have to push
the incrementals through the WAN link.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS receive checksum mismatch

2011-06-10 Thread Jim Klimov

2011-06-10 20:58, Marty Scholes пишет:

If it is true that unlike ZFS itself, the replication
stream format has
no redundancy (even of ECC/CRC sort), how can it be
used for
long-term retention "on tape"?

It can't.  I don't think it has been documented anywhere, but I believe that it 
has been well understood that if you don't trust your storage (tape, disk, 
floppies, punched cards, whatever), then you shouldn't trust your incremental 
streams on that storage.


Well, the whole point of this redundancy in ZFS is about not trusting
any storage (maybe including RAM at some time - but so far it is
requested to be ECC RAM) ;)

Hell, we don't ultimately trust any storage...
Oops, I forgot what I wanted to say next ;)


It's as if the ZFS design assumed that all incremental streams would be either 
perfect or retryable.


Yup. Seems like another ivory-tower assumption ;)


This is a huge problem for tape retention, not so much for disk retention.


Because why? You can make mirrors or raidz of disks?


On a personal level I have handled this with a separate pool of fewer, larger 
and slower drives which serves solely as backup, taking incremental streams 
from the main pool every 20 minutes or so.

Unfortunately that approach breaks the legacy backup strategy of pretty much 
every company.


I'm afraid it also breaks backups of petabyte-sized arrays where
it is impractical to double or triple the number of racks with spinning
drives, but is practical to have a closet full of tapes for the automated
robot-feeded ;)



I think the message is that unless you can ensure the integrity of the stream, 
either backups should go to another pool or zfs send/receive should not be a 
critical part of the backup strategy.


Or that zfs streams can be improved to VALIDLY become part
of such strategy.

Regarding the checksums in ZFS, as of now I guess we
can send the ZFS streams to a file, compress this file
with ZIP, RAR or some other format with CRC and some
added "recoverability" (i.e. WinRAR claims to be able
to repair about 1% of erroneous file data with standard
settings) and send these ZIP/RAR archives to the tape.

Obviously, a standard integrated solution within ZFS
would be better and more portable.

See FEC suggestion from another poster ;)


//Jim


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS receive checksum mismatch

2011-06-10 Thread Marty Scholes
> If it is true that unlike ZFS itself, the replication
> stream format has
> no redundancy (even of ECC/CRC sort), how can it be
> used for
> long-term retention "on tape"?

It can't.  I don't think it has been documented anywhere, but I believe that it 
has been well understood that if you don't trust your storage (tape, disk, 
floppies, punched cards, whatever), then you shouldn't trust your incremental 
streams on that storage.

It's as if the ZFS design assumed that all incremental streams would be either 
perfect or retryable.

This is a huge problem for tape retention, not so much for disk retention.

On a personal level I have handled this with a separate pool of fewer, larger 
and slower drives which serves solely as backup, taking incremental streams 
from the main pool every 20 minutes or so.

Unfortunately that approach breaks the legacy backup strategy of pretty much 
every company.

I think the message is that unless you can ensure the integrity of the stream, 
either backups should go to another pool or zfs send/receive should not be a 
critical part of the backup strategy.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs any zfs-related programs, eats all RAM and dies in swapping hell

2011-06-10 Thread Jim Klimov

2011-06-10 18:00, Steve Gonczi пишет:

Hi Jim,

I wonder what  OS version you are running?

There was a problem similar to what you are describing in earlier versions
in the 13x kernel series.

Should not be present in the 14x kernels.


It is OpenIndiana oi_148a, and unlike many other details -
this one was in my email post today ;)


I missed the system parameters in your earlier emails.


Other config info: one dual-core P4 @2.8Ghz or so, 8Gb RAM
(max for the motherboard), 6*2Tb Seagate ST2000DL003 disks
in raidz2 plus an old 80Gb disk for the OS and swap.

This system turned overnight from a test box to a backup
of some of my old-but-needed files, to their only storage
after the original server was cleaned. So I do want this
box to work reliably and not lose the data which is on it
already, and without dedup at 1.5x I'm running close to
not fitting in these 8Tb ;)


The usual  recommended solution is
"oh just do not use dedup, it is not production ready"


Well, alas, I am getting to the same conclusion so far...
I have never seen any remotely similar issues on any other
servers I maintain, but this one is the first guinea-pig
box for dedup. I guess for the next few years (until 128Gb
RAM or so would become the norm for a cheap home enthusiast
NAS) this one may be the last, too ;)


Using Dedup is more or less  hopeless with less than 16Gigs of memory.
(More is better)


Well... 8Gb RAM is not as low-end configuration as some
others discussed in home-NAS-with-ZFS blogs which claimed
to have used dedup when it first came out. :)

Since there's little real information about DDT appetites
so far (this list is still buzzing with calculations and
tests), I assumed that 8Gb RAM is a reasonable amount for
starters. At least it would have been for, say, Linux or
other open-source enthusiasts communities, which have to
do with whatever crappy hardware they got second-handed ;)
I was in that camp, and I know that personal budgets do
not often assume more than 1-2k$ per box. Until recently
that included very little RAM. And as I said, this is the
maximum which can be put into my motherboard anyway.

Anyhow, this box has two pools at the moment:
* "pool" is the physical raidz2 on 6 disks with ashift=12
  Some datasets on pool are deduped and compressed (lzjb).
* "dcpool" is built in a compressed volume inside "pool",
  which is loopback-mounted over iSCSI and in the resulting
  disk I made another pool with deduped datasets.

This "dcpool" was used to test the idea about separating
compression and deduplication (so that dedup decisions
are made about raw source data, and after that whatever
has to be written is compressed - once).

The POC worked somewhat - except that I used small block
sizes in the "pool/dcpool" volume, so ZFS metadata to
address the volume blocks takes up as much as the userdata.

Performance was abysmal - around 1Mb/sec to write into
"dcpool" lately, and not much faster to read it, so for
the past month I am trying to evacuate my data back from
"dcpool" into "pool" which performs faster - about 5-10Mb/s
during these copies between pools, and according to iostat,
the physical harddisks are quite busy (over 60%) while
"dcpool" is often stuck at 100% busy with several seconds(!)
of wait times and zero IO operations. The physical "pool"
datasets without dedup performed at "wirespeed" 40-50Mb/s
when I was copying files over CIFS from another computer
over a home gigabit LAN. Since this box is primarily an
archive storage with maybe a few datasets dedicated to
somewhat active data (updates to photo archive), slow
but infrequent IO to deduped data is okay with me --
as long as the system doesn't crash as it does now.

Partially, this is why I bumped up TXG Sync intervals
to 30 seconds - so that ZFS would have more data in
buffers after slow IO to coalesce and try to minimize
fragmentation and mechanical IOPS involved.


The issue was an incorrectly sized buffer that caused ZFS to be 
waiting too long
for a buffer allocation.  I can dig up the bug number and the fix 
description

if you are running someting 130-ish.


Unless this fix was not integrated into OI for some reason,
I am afraid digging it up would be of limited help. Still,
I would be interested to read the summary, postmortem and
workarounds.

Maybe this was broken again in OI by newer "improvements"? ;)


The thing I would want to check is the sync times and frequencies.
You can dtrace (and timestamp) this.


Umm... could you please suggest a script, preferably one
that I can leave running on console and printing stats
every second or so?

I can only think of "time sync" so far. And I also think
it could be counted in a few seconds or more.



I would suspect when the bad  state occurs, your sync is taking a
_very_ long time.


When the VERY BAD state occurs, I can no longer use the
system or test anything ;)

When it nearly occurs, I have only a few seconds of uptime
left, and since each run boot-to-crash takes roughly 2-3
hours now, I 

Re: [zfs-discuss] ZFS receive checksum mismatch

2011-06-10 Thread David Magda
On Fri, June 10, 2011 07:47, Edward Ned Harvey wrote:

> #1  A single bit error causes checksum mismatch and then the whole data
> stream is not receivable.

I wonder if it would be worth adding a (toggleable?) forward error
correction (FEC) [1] scheme to the 'zfs send' stream.

Even if we're talking about a straight zfs send/recv pipe, and not saving
to a file, it'd be handy as you wouldn't have restart a large transfer for
a single bit error (especially for those long initial syncs of remote
'mirrors').

[1] http://en.wikipedia.org/wiki/Forward_error_correction


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ACARD ANS-9010 cannot work well LSI 9211-8i SAS inerface

2011-06-10 Thread Fred Liu
Just want to share with you. We have found and been suffering from some weird 
issues because of
it. It is better to connect them directly to the sata ports on the mainboard.


Thanks.

Fred
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs any zfs-related programs, eats all RAM and dies in swapping hell

2011-06-10 Thread Jim Klimov

2011-06-10 13:51, Jim Klimov пишет:

and the system dies in
swapping hell (scanrates for available pages were seen to go
into millions, CPU context switches reach 200-300k/sec on a
single dualcore P4) after eating the last stable-free 1-2Gb
of RAM within a minute. After this the system responds to
nothing except the reset button.



I've captured an illustration for this today, with my watchdog as
well as vmstat, top and other tools. Half a gigabyte in under one
second - the watchdog never saw it coming :(

My "freeram-watchdog" is based on vmstat but emphasizes
deltas in "freeswap" and "freeram" values (see middle columns)
and has less fields with more readable names ;)

freq freeswap freeram scanrate Dswap Dram in sy cs us sy id
1 6652236 497088 0 0 -5380 3428 3645 3645 0 83 17
1 6652236 502112 0 0 5024 2332 2962 2962 1 72 27
1 6652236 494656 0 0 -7456 2886 3641 3641 0 78 21
1 6652236 502024 0 0 7368 3748 4197 4197 1 83 16
1 6652236 502316 0 0 292 4090 2516 2516 0 68 32
1 6652236 498388 0 0 -3928 2270 3940 3940 1 76 24
1 6652236 502264 0 0 3876 3097 3097 3097 0 76 23
1 6652236 495052 0 0 -7212 2705 2796 2796 1 86 14
1 6652236 502384 0 0 7332 3609 4449 4449 1 81 18
1 6652236 502292 0 0 -92 3639 2639 2639 1 80 19
1 6652236 92064 3435680 0 -410228 15665 1312 1312 0 99 0

In VMSTAT it looked like this:
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s2 s3 s4 in sy cs us sy id
0 0 0 6652236 495052 0 32 0 0 0 0 0 0 0 0 0 3037 4107 158598 0 90 10
0 0 0 6652236 502384 0 24 0 0 0 0 0 0 0 0 0 3266 2697 114195 1 78 21
6 0 0 6652236 502292 0 23 0 0 0 0 0 0 15 15 15 2947 3048 130070 0 87 13
29 0 0 6652236 92064 124 155 0 0 5084 0 3706374 0 0 0 0 16743 1244 2696 
0 100 0


So, for a couple of minutes before the freeze, the system was rather
stable at around 500mb free RAM. Before this couple of minutes it
was stable at 1-1.2Gb, but then jumped down to 500m in about 10sec.

And in the last second of known uptime, the system ate up at least
400Mb and began scanning for free pages at 3.7Mil scans/sec.
Usually reaching this condition takes about 3-5 seconds, and
I see "cs" going up to about 200k, and my watchdog has time
to reboot the system. Not this time :(

According to TOP, the free RAM dropped down to 32Mb (which in
my older adventures was also the empirical lower limit of RAM when
the system began scanrating to death), with zpool process ranking
high - but I haven't yet seen pageout make it to top in its last second
of life:

last pid: 1786; load avg: 3.59, 2.10, 1.09; up 0+02:20:28 15:07:20
118 processes: 100 sleeping, 16 running, 2 on cpu
CPU states: 0.0% idle, 0.4% user, 99.6% kernel, 0.0% iowait, 0.0% swap
Kernel: 2807 ctxsw, 210 trap, 16730 intr, 1388 syscall, 161 flt
Memory: 8191M phys mem, 32M free mem, 6655M total swap, 6655M free swap

PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND
1464 root 138 99 -20 0K 0K sleep 2:01 30.93% zpool-dcpool
2 root 2 97 -20 0K 0K cpu/1 0:03 26.67% pageout
1220 root 1 59 0 4400K 2188K sleep 1:04 0.57% prstat
1477 root 1 59 0 2588K 1756K run 3:13 0.28% freeram-watchdo
3 root 1 60 -20 0K 0K sleep 0:17 0.21% fsflush
522 root 1 59 0 4172K 1000K run 0:35 0.20% top


One way or another, such repeatable failure behaviour is very
not acceptable for a production storage platform :( and I hope
to see it fixed - if I can help somehow?..

I *think* one way to reproduce it would be:
1) Enable dedup (optional?)
2) Write lots of data to disk, i.e. 2-3Tb
3) Delete lots of data, or make and destroy a snapshot,
or destroy a dataset with test data

This puts the system into position with lots of processing of
(not-yet-)deferred deletes.

In my case this by itself often leads to RAM starvation and
hangs and a following reset of the box; you can reset the
TEST system during such delete processing.

Now when you reboot and try to import this test pool, you
should have a situation like mine - the pool does not quckly
import, zfs-related commands hang, and in a few hours
the box should die ;)

iostat reports many small reads and occasional writes
(starting after about 10 minutes into import) which gives
me hope that the pool will come back online sometime...


The current version of my software watchdog which saves some
trouble for my assistant by catching near-freeze conditions,
is here:

* http://thumper.cos.ru/~jim/freeram-watchdog-20110610-v0.11.tgz




--


++
||
| Климов Евгений, Jim Klimov |
| технический директор   CTO |
| ЗАО "ЦОС и ВТ"  JSC COS&HT |
||
| +7-903-7705859 (cellular)  mailto:jimkli...@cos.ru |
|  CC:ad...@cos.ru,jimkli...@mail.ru |
+

Re: [zfs-discuss] ZFS receive checksum mismatch

2011-06-10 Thread Jim Klimov

2011-06-10 15:58, Darren J Moffat пишет:


As I pointed out last time this came up the NDMP service on Solaris 11 
Express and on the Oracle ZFS Storage Appliance uses the 'zfs send' 
stream as what is to be stored on the "tape".




This discussion turns interesting ;)

Just curious: how do these products work around the stream fragility
which we are discussing here - that a single-bit error can/will/should
make the whole zfs send stream invalid, even though it is probably
an error localized in a single block. This block is ultimately related
to a file (or a few files in case of dedup or snapshots/clones) whose
name "zfs recv" could report for an admin to take action such as rsync.

If it is true that unlike ZFS itself, the replication stream format has
no redundancy (even of ECC/CRC sort), how can it be used for
long-term retention "on tape"?

I understand about online transfers, somewhat. If the transfer failed,
you still have the original to retry. But backups are often needed when
the original is no longer alive, and that's why they are needed ;)

And by Murphy's law that's when this single bit strikes ;)

Is such "tape" storage only intended for reliable media such as
another ZFS or triple-redundancy tape archive with fancy robotics?
How would it cope with BER in transfers to/from such media?

Also, an argument was recently posed (when I wrote of saving
zfs send streams into files and transferring them by rsync over
slow bad links), that for most online transfers I should better use
zfs send of incremental snapshots. While I agree with this in terms
that an incremental transfer is presumably smaller and has less
chance of corruption (network failure) during transfer than a huge
initial stream, this chance of corruption is still non-zero. Simply
in case of online transfers I can detect the error and retry at low
cost (or big cost - bandwidth is not free in many parts of the world).

Going back to storing many streams (initial + increments) on tape -
if an intermediate incremental stream has a single-bit error, then
its snapshot and any which follow-up can not be received into zfs.
Even if the "broken" block is later freed and discarded (equivalent
to overwriting with a newer version of a file from a newer increment
in classic backup systems with a file being the unit of backup).

And since the total size of initial+incremental backups is likely
larger than of a single full dump, the chance of a single corruption
making your (latest) backup useless would be also higher, right?

Thanks for clarifications,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS receive checksum mismatch

2011-06-10 Thread Darren J Moffat

On 06/10/11 12:47, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Jonathan Walker

New to ZFS, I made a critical error when migrating data and
configuring zpools according to needs - I stored a snapshot stream to
a file using "zfs send -R [filesystem]@[snapshot]>[stream_file]".


There are precisely two reasons why it's not recommended to store a zfs send
datastream for later use.  As long as you can acknowledge and accept these
limitations, then sure, go right ahead and store it.  ;-)  A lot of people
do, and it's good.


Not recommended by who ?  Which documentation says this ?

As I pointed out last time this came up the NDMP service on Solaris 11 
Express and on the Oracle ZFS Storage Appliance uses the 'zfs send' 
stream as what is to be stored on the "tape".


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS receive checksum mismatch

2011-06-10 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Jonathan Walker
> 
> New to ZFS, I made a critical error when migrating data and
> configuring zpools according to needs - I stored a snapshot stream to
> a file using "zfs send -R [filesystem]@[snapshot] >[stream_file]".

There are precisely two reasons why it's not recommended to store a zfs send
datastream for later use.  As long as you can acknowledge and accept these
limitations, then sure, go right ahead and store it.  ;-)  A lot of people
do, and it's good.

#1  A single bit error causes checksum mismatch and then the whole data
stream is not receivable.  Obviously you encountered this problem already,
and you were able to work around.  If I were you, however, I would be
skeptical about data integrity on your system.  You said you scrubbed and
corrected a couple of errors, but that's not actually possible.  The
filesystem integrity checksums are for detection, not correction, of
corruption.  The only way corruption gets corrected is when there's a
redundant copy of the data...  Then ZFS can discard the corrupt copy,
overwrite with a good copy, and all the checksums suddenly match.  Of course
there is no such thing in the zfs send data stream - no redundant copy in
the data stream.  So yes, you have corruption.  The best you can possibly do
is to identify where it is, and then remove the affected files.

#2  You cannot do a partial receive, nor generate a catalog of the files
within the datastream.  You can restore the whole filesystem or nothing.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS receive checksum mismatch

2011-06-10 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Jim Klimov
> 
> Besides, the format
> is not public and subject to change, I think. So future compatibility
> is not guaranteed.

That is not correct.  

Years ago, there was a comment in the man page that said this:  "The format
of the stream is evolving. No backwards  compatibility is guaranteed. You
may not be able to receive your streams on future versions of ZFS."

But in the last several years, backward/forward compatibility has always
been preserved, so despite the warning, it was never a problem.

In more recent versions, the man page says:  "The format of the stream is
committed. You will be able to receive your streams on future versions of
ZFS."

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool import hangs any zfs-related programs, eats all RAM and dies in swapping hell

2011-06-10 Thread Jim Klimov

The subject says it all, more or less: due to some problems
with a pool (i.e. deferred deletes a month ago, possibly
similar now), the "zpool import" hangs any zfs-related
programs, including "zfs", "zpool", "bootadm", sometimes "df".

After several hours of disk-thrashing all 8Gb of RAM in the
system is consumed (by kernel I guess, because "prstat" and
"top" don't show any huge processes) and the system dies in
swapping hell (scanrates for available pages were seen to go
into millions, CPU context switches reach 200-300k/sec on a
single dualcore P4) after eating the last stable-free 1-2Gb
of RAM within a minute. After this the system responds to
nothing except the reset button.

ZDB walks were seen to take up over 20Gb of VM, but as the
ZDB is a userland process - it could swap. I guess that the
kernel is doing something similar in appetite - but can't
swap out the kernel memory.

So regarding the hanging ZFS-related programs, I think
there's some bad locking involved (i.e. I should be able
to see or config other pools beside the one being imported?),
and regarding the VM depletion without swapping - that seems
like a kernel problem.

Partial problem is - the box is on "remote support" (while
it is a home NAS, I am away from home for months - so my
neighbor assists by walking in to push reset). While I was
troubleshooting the problem I wrote a watchdog program
based on vmstat, which catches bad conditions and calls
uadmin(2) to force an ungraceful software reboot. Quite
often it has not enough time to react, though - 1-second
strobes into kernel VM stats are a very long period :(

The least I can say is that this is very annoying, to the
point that I am not sure what variant of Solaris to build
my customers' and friends' NASes with. This box is curently
on OI_148a with the updated ZFS package from Mar 2011, and
while I am away I am not sure I can safely remotely update
this box.

Actually I wrote about this situation in detail on the forums,
but that was before web-posts were forwarded to email so I
never got any feedback. There's a lot of detailed text in
these threads so I wouldn't go over all of it again now:
* http://opensolaris.org/jive/thread.jspa?threadID=138604&tstart=0 
<http://opensolaris.org/jive/thread.jspa?threadID=138604&tstart=0>
* http://opensolaris.org/jive/thread.jspa?threadID=138740&tstart=0 
<http://opensolaris.org/jive/thread.jspa?threadID=138740&tstart=0>


Back then it took about a week of reboots for the "pool" to get finally
imported, with no visible progress-tracker except running ZDB to see
that deferred-free list is decreasing, and wondering if maybe it was
the culprit (in the end of that problem, it was). I was also lucky
that this ZFS cleanup from deferred-free blocks was cumulative and
the gained progress survived over reboots. Currently I have little
idea what is the problem with my "dcpool" (lives in a volume in
"pool" and mounts over iSCSI) - ZDB did not finish yet, and two
days of reboots every 3 hours did not fix the problem, the "dcpool"
does not import yet.

Since my box's OS is OpenIndiana, I started a few bugs to track
these problems as well, with little activity from other posters:
* https://www.illumos.org/issues/841
* https://www.illumos.org/issues/956

The current version of my software watchdog which saves some
trouble for my assistant by catching near-freeze conditions,
is here:

* http://thumper.cos.ru/~jim/freeram-watchdog-20110610-v0.11.tgz

I guess it is time for questions now :)

What methods can I use (beside 20-hour-long ZDB walks) to
gain a quick insight on the cause of problems - why doesn't
the pool import quickly? Does it make and keep any progress
while trying to import over numerous reboots? How much is left?

Are there any tunables I did not try yet? Currently I have
the following settings to remedy different performance and
stability problems of this box:

# cat /etc/system | egrep -v '\*|^$'
set zfs:aok = 1
set zfs:zfs_recover = 1
set zfs:zfs_resilver_delay = 0
set zfs:zfs_resilver_min_time_ms = 2
set zfs:zfs_scrub_delay = 0
set zfs:zfs_arc_max=0x1a000
set zfs:arc_meta_limit = 0x18000
set zfs:zfs_arc_meta_limit = 0x18000
set zfs:metaslab_min_alloc_size = 0x8000
set zfs:metaslab_smo_bonus_pct = 0xc8
set zfs:zfs_write_limit_override = 0x1800
set zfs:zfs_txg_timeout = 30
set zfs:zfs_txg_synctime = 30
set zfs:zfs_vdev_max_pending = 5


Are my guesses about "this is a kernel problem" anyhow correct?

Did by chance any related fixes make way into the development
versions of newer OpenIndianas (148b, 151, pkg-dev repository)
already?

Thanks for any comments, condolenscences, insights, bugfixes ;)
//Jim Klimov


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss