Re: [zfs-discuss] Incremental backup via zfs send / zfs receive

2009-09-20 Thread Peter Pickford
just destroy the swap snapshot and it doesn't get sent when you do a full send

2009/9/20 Frank Middleton :
> A while back I posted a script that does individual send/recvs
> for each file system, sending incremental streams if the remote
> file system exists, and regular streams if not.
>
> The reason for doing it this way rather than a full recursive
> stream is that there's no way to avoid sending certain file
> systems such as swap, and it would be nice not to always send
> certain properties such as mountpoint, and there might be file
> systems you want to keep on the receiving end.
>
> The problem with the regular stream is that most of the file
> system properties (such as mountpoint) are not copied as they
> are with a recursive stream. This may seem an advantage to some,
> (e.g., if the remote mountpoint is already in use, the mountpoint
> seems to default to legacy). However, did I miss anything in the
> documentation, or would it be worth submitting an RFE for an
> option to send/recv properties in a non-recursive stream?
>
> Oddly, incremental non-recursive streams do seem to override
> properties, such as mountpoint, hence the /opt problem. Am I
> missing something, or is this really an inconsistency? IMO
> non-recursive regular and incremental streams should behave the
> same way and both have options to send or not send properties.
> For my purposes the default behavior is reversed for what I
> would like to do...
>
> Thanks -- Frank
>
> Latest version of the  script follows; suggestions for improvements
> most welcome, especially the /opt problem where source and destination
> hosts have different /opts (host6-opt and host5-opt here) - see
> ugly hack below (/opt is on the data pool because the boot disks
> - soon to be SSDs - are filling up):
>
> #!/bin/bash
> #
> # backup is the alias for the host receiving the stream
> # To start, do a full recursive send/receive and put the
> # name of the initial snapshot in cur_snap, In case of
> # disasters, the older snap name is saved in cur_snap_prev
> # and there's an option not to delete any snapshots when done.
> #
> if test ! -e cur_snap; then echo cur_snap not found; exit; fi
> P=`cat cur_snap`
> mv -f cur_snap cur_snap_prev
> T=`date "+%Y-%m-%d:%H:%M:%S"`
> echo $T > cur_snap
> echo snapping to sp...@$t
> echo Starting backup from sp...@$p to sp...@$t at `date` >> snap_time
> zfs snapshot -r sp...@$t
> echo snapshot done
> for FS in `zfs list -H | cut -f 1`
> do
> RFS=`ssh backup "zfs list -H $FS 2>/dev/null" | cut  -f 1`
> case $FS in
> "space/")
>  echo skipping $FS
>  ;;
> *)
>  if test "$RFS"; then
>    if [ "$FS" = "space/swap" ]; then
>      echo skipping $FS
>    else
>      echo do zfs send -i $...@$p $...@$t I ssh backup zfs recv -vF $RFS
>              zfs send -i $...@$p $...@$t | ssh backup zfs recv -vF $RFS
>    fi
>  else
>    echo do zfs send $...@$t I ssh backup zfs recv -v $FS
>            zfs send $...@$t | ssh backup zfs recv -v $FS
>  fi
>  if [ "$FS" = "space/host5-opt" ]; then
>  echo do ssh backup zfs set mountpoint=legacy space/host5-opt
>          ssh backup zfs set mountpoint=legacy space/host5-opt
>  fi
>  ;;
> esac
> done
>
> echo --Ending backup from sp...@$p to sp...@$t at `date` >> snap_time
>
> DOIT=1
> while [ $DOIT -eq 1 ]
> do
>  read -p "Delete old snapshot  " REPLY
>  REPLY=`echo $REPLY | tr '[:upper:]' '[:lower:]'`
>  case $REPLY in
>    "y")
>      ssh backup "zfs destroy -r sp...@$p"
>      echo Remote sp...@$p destroyed
>      zfs destroy -r sp...@$p
>      echo Local sp...@$p destroyed
>      DOIT=0
>      ;;
>    "n")
>      echo Skipping:
>      echo "   "ssh backup "zfs destroy -r sp...@$p"
>      echo "   "zfs destroy -r sp...@$p
>      DOIT=0
>      ;;
>     *)
>      echo "Please enter y or n"
>      ;;
>  esac
> done
>
>
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Real help

2009-09-20 Thread Bogdan M. Maryniuk
On Mon, Sep 21, 2009 at 3:41 AM, vattini giacomo  wrote:
> sudo zpool destroy hazz0
> sudo reboot
> Now opensolaris is not booting everything is vanished

ROFL

This actually has to go to the daily WTF... :-)

-- 
Kind regards, BM

Things, that are stupid at the beginning, rarely ends up wisely.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-20 Thread Richard Elling

If you are just building a cache, why not just make a file system and
put a reservation on it? Turn off auto snapshots and set other features
as per best practices for your workload? In other words, treat it like  
we

treat dump space.

I think that we are getting caught up in trying to answer the question
you ask rather than solving the problem you have... perhaps because
we don't understand the problem.
 -- richard

On Sep 20, 2009, at 2:17 PM, Andrew Deason wrote:


On Fri, 18 Sep 2009 17:54:41 -0400
Robert Milkowski  wrote:


There will be a delay of up-to 30s currently.

But how much data do you expect to be pushed within 30s?
Lets say it would be even 10g to lots of small file and you would
calculate the total size by only summing up a logical size of data.
Would you really expect that an error would be greater than 5% which
would be 500mb. Does it matter in practice?


Well, that wasn't the problem I was thinking of. I meant, if we have  
to

wait 30 seconds after the write to measure the disk usage... what do I
do, just sleep 30s after the write before polling for disk usage?

We could just ask for disk usage when we write, knowing that it  
doesn't
take into account the write we are performing... but we're changing  
what
we're measuring, then. If we are removing things from the cache in  
order

to free up space, how do we know when to stop?

To illustrate: normally when the cache is 98% full, we remove items
until we are 95% full before we allow a write to happen again. If we
relied on statvfs information for our disk usage information, we would
start removing items at 98%, and have no idea when we hit 95% unless  
we

wait 30 seconds.

If you are simply saying that the difference in logical size and used
disk blocks on ZFS are similar enough not to make a difference...  
well,

that's what I've been asking. I have asked what the maximum difference
is between "logical size rounded up to recordsize" and "size taken  
up on

disk", and haven't received an answer yet. If the answer is "small
enough that you don't care", then fantastic.


what is user enables compression like lzjb or even gzip?
How would you like to take it into account before doing writes?

What if user creates a snapshot? How would you take it into account?


Then it will be wrong; we do not take them into account. I do not care
about those cases. It is already impossible to enforce that the cache
tracking data is 100% correct all of the time.

Imagine we somehow had a way to account for all of those cases you
listed, and would make me happy. Say the directory the user uses for  
the
cache data is /usr/vice/cache (one standard path to put it). The  
OpenAFS
client will put cache data in e.g. /usr/vice/cache/D0/V1 and a bunch  
of

other files.  If the user puts their own file in
/usr/vice/cache/reallybigfile, our cache tracking information will
always be off, in all current implementations.  We have no control  
over

it, and we do not try to solve that problem.

I am treating the cases of "what if the user creates a snapshot" and  
the
like as a similar situation. If someone does that and runs out of  
space,

it is pretty easy to troubleshoot their system and say "you have a
snapshot of the cache dataset; do not do that". Right now, if someone
runs an OpenAFS client cache on zfs and runs out of space, the only
thing I can tell them is "don't use zfs", which I don't want to do.

If it works for _a_ configuration -- the default one -- that is all  
I am

asking for.


I'm under suspicion that you are looking too closely  for no real
benefit. Especially if you don't want to dedicate a dataset to cache
you would expect other  applications in a system  to write to the
same file system but different locations which you have no control or
ability to predict how much data will be written at all. Be it Linux,
Solaris, BSD, ... the issue will be there.


It is certainly possible for other applications to fill up the disk.  
We

just need to ensure that we don't fill up the disk to block other
applications. You may think this is fruitless, and just from that
description alone, it may be. But you must understand that without an
accurate bound on the cache, well... we can eat up the disk a lot  
faster

than other applications without the user realizing it.

--
Andrew Deason
adea...@sinenomine.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Incremental backup via zfs send / zfs receive

2009-09-20 Thread Frank Middleton

A while back I posted a script that does individual send/recvs
for each file system, sending incremental streams if the remote
file system exists, and regular streams if not.

The reason for doing it this way rather than a full recursive
stream is that there's no way to avoid sending certain file
systems such as swap, and it would be nice not to always send
certain properties such as mountpoint, and there might be file
systems you want to keep on the receiving end.

The problem with the regular stream is that most of the file
system properties (such as mountpoint) are not copied as they
are with a recursive stream. This may seem an advantage to some,
(e.g., if the remote mountpoint is already in use, the mountpoint
seems to default to legacy). However, did I miss anything in the
documentation, or would it be worth submitting an RFE for an
option to send/recv properties in a non-recursive stream?

Oddly, incremental non-recursive streams do seem to override
properties, such as mountpoint, hence the /opt problem. Am I
missing something, or is this really an inconsistency? IMO
non-recursive regular and incremental streams should behave the
same way and both have options to send or not send properties.
For my purposes the default behavior is reversed for what I
would like to do...

Thanks -- Frank

Latest version of the  script follows; suggestions for improvements
most welcome, especially the /opt problem where source and destination
hosts have different /opts (host6-opt and host5-opt here) - see
ugly hack below (/opt is on the data pool because the boot disks
- soon to be SSDs - are filling up):

#!/bin/bash
#
# backup is the alias for the host receiving the stream
# To start, do a full recursive send/receive and put the
# name of the initial snapshot in cur_snap, In case of
# disasters, the older snap name is saved in cur_snap_prev
# and there's an option not to delete any snapshots when done.
#
if test ! -e cur_snap; then echo cur_snap not found; exit; fi
P=`cat cur_snap`
mv -f cur_snap cur_snap_prev
T=`date "+%Y-%m-%d:%H:%M:%S"`
echo $T > cur_snap
echo snapping to sp...@$t
echo Starting backup from sp...@$p to sp...@$t at `date` >> snap_time
zfs snapshot -r sp...@$t
echo snapshot done
for FS in `zfs list -H | cut -f 1`
do
RFS=`ssh backup "zfs list -H $FS 2>/dev/null" | cut  -f 1`
case $FS in
"space/")
  echo skipping $FS
  ;;
*)
  if test "$RFS"; then
if [ "$FS" = "space/swap" ]; then
  echo skipping $FS
else
  echo do zfs send -i $...@$p $...@$t I ssh backup zfs recv -vF $RFS
  zfs send -i $...@$p $...@$t | ssh backup zfs recv -vF $RFS
fi
  else
echo do zfs send $...@$t I ssh backup zfs recv -v $FS
zfs send $...@$t | ssh backup zfs recv -v $FS
  fi
  if [ "$FS" = "space/host5-opt" ]; then
  echo do ssh backup zfs set mountpoint=legacy space/host5-opt
  ssh backup zfs set mountpoint=legacy space/host5-opt
  fi
  ;;
esac
done

echo --Ending backup from sp...@$p to sp...@$t at `date` >> snap_time

DOIT=1
while [ $DOIT -eq 1 ]
do
  read -p "Delete old snapshot  " REPLY
  REPLY=`echo $REPLY | tr '[:upper:]' '[:lower:]'`
  case $REPLY in
"y")
  ssh backup "zfs destroy -r sp...@$p"
  echo Remote sp...@$p destroyed
  zfs destroy -r sp...@$p
  echo Local sp...@$p destroyed
  DOIT=0
  ;;
"n")
  echo Skipping:
  echo "   "ssh backup "zfs destroy -r sp...@$p"
  echo "   "zfs destroy -r sp...@$p
  DOIT=0
  ;;
 *)
  echo "Please enter y or n"
  ;;
  esac
done



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-20 Thread Andrew Deason
On Fri, 18 Sep 2009 17:54:41 -0400
Robert Milkowski  wrote:

> There will be a delay of up-to 30s currently.
> 
> But how much data do you expect to be pushed within 30s?
> Lets say it would be even 10g to lots of small file and you would 
> calculate the total size by only summing up a logical size of data. 
> Would you really expect that an error would be greater than 5% which 
> would be 500mb. Does it matter in practice?

Well, that wasn't the problem I was thinking of. I meant, if we have to
wait 30 seconds after the write to measure the disk usage... what do I
do, just sleep 30s after the write before polling for disk usage?

We could just ask for disk usage when we write, knowing that it doesn't
take into account the write we are performing... but we're changing what
we're measuring, then. If we are removing things from the cache in order
to free up space, how do we know when to stop?

To illustrate: normally when the cache is 98% full, we remove items
until we are 95% full before we allow a write to happen again. If we
relied on statvfs information for our disk usage information, we would
start removing items at 98%, and have no idea when we hit 95% unless we
wait 30 seconds.

If you are simply saying that the difference in logical size and used
disk blocks on ZFS are similar enough not to make a difference... well,
that's what I've been asking. I have asked what the maximum difference
is between "logical size rounded up to recordsize" and "size taken up on
disk", and haven't received an answer yet. If the answer is "small
enough that you don't care", then fantastic.

> what is user enables compression like lzjb or even gzip?
> How would you like to take it into account before doing writes?
> 
> What if user creates a snapshot? How would you take it into account?

Then it will be wrong; we do not take them into account. I do not care
about those cases. It is already impossible to enforce that the cache
tracking data is 100% correct all of the time.

Imagine we somehow had a way to account for all of those cases you
listed, and would make me happy. Say the directory the user uses for the
cache data is /usr/vice/cache (one standard path to put it). The OpenAFS
client will put cache data in e.g. /usr/vice/cache/D0/V1 and a bunch of
other files.  If the user puts their own file in
/usr/vice/cache/reallybigfile, our cache tracking information will
always be off, in all current implementations.  We have no control over
it, and we do not try to solve that problem.

I am treating the cases of "what if the user creates a snapshot" and the
like as a similar situation. If someone does that and runs out of space,
it is pretty easy to troubleshoot their system and say "you have a
snapshot of the cache dataset; do not do that". Right now, if someone
runs an OpenAFS client cache on zfs and runs out of space, the only
thing I can tell them is "don't use zfs", which I don't want to do.

If it works for _a_ configuration -- the default one -- that is all I am
asking for.

> I'm under suspicion that you are looking too closely  for no real
> benefit. Especially if you don't want to dedicate a dataset to cache
> you would expect other  applications in a system  to write to the
> same file system but different locations which you have no control or
> ability to predict how much data will be written at all. Be it Linux,
> Solaris, BSD, ... the issue will be there.

It is certainly possible for other applications to fill up the disk. We
just need to ensure that we don't fill up the disk to block other
applications. You may think this is fruitless, and just from that
description alone, it may be. But you must understand that without an
accurate bound on the cache, well... we can eat up the disk a lot faster
than other applications without the user realizing it.

-- 
Andrew Deason
adea...@sinenomine.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Real help

2009-09-20 Thread vattini giacomo
Under the Ubuntu system i've done a zpool import -D but no way
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] SOLVED: Re: migrating from linux to solaris ZFS

2009-09-20 Thread Paul Archer

Thursday, Paul Archer wrote:


Tomorrow, Fajar A. Nugraha wrote:


There was a post from Ricardo on zfs-fuse list some time ago.
Apparently  if you do a "zpool create" on whole disks, Linux on
Solaris behaves differently:
- solaris will create EFI partition on that disk, and use the partition as 
vdev

- Linux will use the whole disk without any partition, just like with
a file-based vdev.

The result is that you might be unable to import the pool on *solaris or 
*BSD.


The recommended way to create a "portable" pool is to create the pool
on a partition setup recognizable on all those OS. He suggested a
simple DOS/MBR partition table.

So in short, if you had created the pool on top of sda1 instead of
sda, it will work. I'm surprised though that you can "offlined sda and
replaced it with sda1" when previously you said "I see that if I try
to replace sda with sda1, zpool complains that sda1 is too small"



I was a bit surprised about that, too. But I found that a standard PC/Linux 
partition reserves around 24MB at the beginning of the disk, and an EFI (or 
actually, GPT) disklabel and partition only uses a few 100KB.




As I mentioned above, I created GPT disklabels and partitions on all my 
disks, then one-by-one offlined the disk and replaced it with the 
partition from the same disk (eg 'zpool replace datapool ad1 ad1p1').
I did the first replacement with Linux and zfs-fuse. The resilver took 32 
hours. I did the rest in FreeBSD, which took 5-6 hours for each disk.


It was tedious, but the pool is available in Solaris (finally!), so 
hopefully no more NFS issues or kernel panics. (I had NFS issues with 
both Linux and BSD, and kernel panics with BSD.)


Paul

PS. Complicating matters was the fact that for some reason, BSD didn't 
like my LSI 150-6 SATA card (which is the only one Solaris plays nice 
with), so I had to keep switching cards every time I went from one OS to 
the other. Blech. OTOH, here's to Live CDs!

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Real help

2009-09-20 Thread dick hoogendijk

On Sun, 2009-09-20 at 11:41 -0700, vattini giacomo wrote:
> Hi there,i'm in a bad situation,under Ubuntu i was tring to import a solaris 
> zpool that is in /dev/sda1,while the Ubuntu is in /dev/sda5;not being able to 
> mount the solaris pool i decide to destroy the pool created like that
> sudo zfs-fuse
> sudo zpool  create hazz0 /dev/sda1
> sudo zpool destroy hazz0
> sudo reboot
>  Now opensolaris is not booting everything is vanished
> Is there anyhow to restore everything?

Any idea about the meaning of the verb DESTROY ?

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | SunOS 10u7 5/09 | OpenSolaris 2010.02 b123
+ All that's really worth doing is what we do for others (Lewis Carrol)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Real help

2009-09-20 Thread vattini giacomo
Hi there,i'm in a bad situation,under Ubuntu i was tring to import a solaris 
zpool that is in /dev/sda1,while the Ubuntu is in /dev/sda5;not being able to 
mount the solaris pool i decide to destroy the pool created like that
sudo zfs-fuse
sudo zpool  create hazz0 /dev/sda1
sudo zpool destroy hazz0
sudo reboot
 Now opensolaris is not booting everything is vanished
Is there anyhow to restore everything?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] backup disk of rpool on solaris

2009-09-20 Thread Frank Middleton

On 09/20/09 03:20 AM, dick hoogendijk wrote:

On Sat, 2009-09-19 at 22:03 -0400, Jeremy Kister wrote:

I added a disk to the rpool of my zfs root:
# zpool attach rpool c1t0d0s0 c1t1d0s0
# installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c1t1d0s0

I waited for the resilver to complete, then i shut the system down.

then i physically removed c1t0d0 and put c1t1d0 in it's place.

I tried to boot the system, but it panics:


Afaik you can't remove the first disk. You've created a mirror of two
disks from either which you may boot the system. BUT the second disk
must remain where it is. You can set the bios to boot from it if the
first disk fails, but you may not *swap* them.


That's my experience also. If you are trying to make a bootable
disk to keep on the shelf, there's an excellent example here:
http://forums.sun.com/thread.jspa?threadID=5345546

IMO this should go on the wiki. I think it's a great example of
the power of ZFS. I can't imagine doing anything like this with
so easily with any legacy file system...

Cheers -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Persistent errors - do I believe?

2009-09-20 Thread Chris Murray
Ok, the resilver has been restarted a number of times over the past few days 
due to two main issues - a drive disconnecting itself, and power failure. I 
think my troubles are 100% down to these environmental factors, but would like 
some confidence that after the resilver has completed, if it reports there 
aren't any persistent errors, that there actually aren't any.

Attempt #1: the resilver started after I initiated the replace on my SXCE105 
install. All was well until the box lost power. On starting back up, it hung 
while starting OpenSolaris - just after the line containing the system 
hostname. I've had this before when a scrub is in progress. My usual tactic is 
to boot with the 2009.06 live CD, import the pool, stop the scrub, export, 
reboot into SXCE105 again, and import. Of course, you can't stop a replace 
that's in progress, so the remaining attempts are in the 2009.06 live CD (build 
111b perhaps?)

Attempt #2: the resilver started on imported the pool in 2009.06. It was 
resilvering fine until one drive reported itself as offline. dmesg showed that 
the drive was 'gone'. I then noticed a lot of checksum errors at the pool 
level, and RAIDZ1 level, and a large number of 'permanent' errors. In a panic, 
thinking that the resilver was now doing more harm than good, I exported the 
pool and rebooted.

Attempt #3: I imported in 2009.06 again. This time, the drive that was 
disconnected last attempt was online again, and proceeded to resilver along 
with the original drive. There was only one permanent error - in a particular 
snapshot of a ZVOL I'm not too concerned about. This is the point that I wrote 
the original post, wondering if all of those 700+ errors reported the first 
time around weren't a problem any more. I have been running zpool clear in a 
loop because there were checksum errors on another of the drives (neither of 
the two part of the replacing vdev, and not the one that was removed 
previously). I didn't want it to be marked as faulty, so I kept the zpool clear 
running. Then .. power failure.

Attempt #4: I imported in 2009.06. This time, no errors detected at all. Is 
that a result of my zpool clear? Would that clear any 'permanent' errors? From 
the wording, I'd say it wouldn't, and therefore the action of starting the 
resilver again with all of the correct disks in place hasn't found any errors 
so far ... ? Then, disk removal again ... :-(

Attempt #5: I'm convinced that drive removal is down to faulty cabling. I move 
the machine, completely disconnect all drives, re-wire all connections with new 
cables, and start the scrub again in 2009.06. Now, there are checksum errors 
again, so I'm running zpool clear in order to keep drives from being marked as 
faulted .. but I also have this:

errors: Permanent errors have been detected in the following files:
zp/iscsi/meerkat_t...@20090905_1631:<0x1>

I have a few of my usual VMs powered up (ESXi connecting using NFS), and they 
appear to be fine. I've ran a chkdsk in the windows VMs, and no errors are 
reported. Although I can't be 100% confident that any of those files were in 
the original list of 700+ errors. In the absence of iscsitgtd, I'm not powering 
up the ones that rely on iSCSI just yet.

My next steps will be:
1. allow the resilver to finish. Assuming I don't have yet another power cut, 
this will be in about 24 hours.
2. zpool export
3. reboot into SXCE
4. zpool import
5. start all my usual virtual machines on the ESXi host
6. note whether that permanent error is still there <-- this will be an 
interesting one for me - will the export & import clear the error? will my 
looped zpool clear have simply reset the checksum counters to zero, or will it 
have cleared this too?
7. zpool scrub to see what else turns up.

Chris
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS Recv slow with high CPU

2009-09-20 Thread Tristan Ball




Hi Everyone,

I have a couple of systems running opensolaris b118, one of which sends
hourly snapshots to the other. This has been working well, however as
of today, the receiving zfs process has started running extremely
slowly, and is running at 100% CPU on one core, completely in kernel
mode. A little bit of exploration with lockstat and dtrace seems to
imply that the issue is around the "dbuf_free_range" function - or at
least, that's what it looks like to my inexperienced eye!

The system is very unresponsive while this problem is occurring, with
frequent multi-second delays between my typing into an ssh session and
getting a response. 

Has anyone seen this before, or know if there's anything I can do about
it?

Lockstat says:

Adaptive mutex hold: 52299 events in 1.694 seconds (30873 events/sec)

---
Count indv cuml rcnt nsec Lock   Caller
   50  87%  87% 0.00 23410134 0xff0273d6dcb8
dbuf_free_range+0x268

  nsec -- Time Distribution -- count
  33554432 |@@ 50
---
Count indv cuml rcnt nsec Lock   Caller
   53   0%  87% 0.00   117124 0xff025f404140
txg_rele_to_quiesce+0x18

  nsec -- Time Distribution -- count
    131072 |@@ 53
---
Count indv cuml rcnt nsec Lock   Caller
 2536   0%  88% 0.00 2194 0xff024f09bd40
kmem_cache_alloc+0x84

  nsec -- Time Distribution -- count
  2048 |@  1121
  4096 |   1383
  8192 |   0
 16384 |   32
---
Count indv cuml rcnt nsec Lock   Caller
  415   0%  88% 0.00 7968 0xff025120ce00 anon_resvmem+0xb4

[snip]

I modified the "procsystime" script from the dtrace toolkit to trace
fbt:zfs::entry/return rather than syscalls, and I get an output like
this (approximately a 5sec sample):
Function                        nsecs
l2arc_write_eligible   46155456
spa_last_synced_txg   49211711
zio_inherit_child_errors   50039132
  zil_commit   50137305
    buf_hash   50839858
   dbuf_rele   56388970
zio_wait_for_children   67533094
dbuf_update_data   70302317
dsl_dir_space_towrite   72179496
    parent_delta   77685327
   dbuf_hash   79075532
free_range_compar   82479196
 dbuf_free_range 2715449416
  TOTAL: 4156078240

Thanks,
    Tristan





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] backup disk of rpool on solaris

2009-09-20 Thread dick hoogendijk
On Sat, 2009-09-19 at 22:03 -0400, Jeremy Kister wrote:
> I added a disk to the rpool of my zfs root:
> # zpool attach rpool c1t0d0s0 c1t1d0s0
> # installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c1t1d0s0
> 
> I waited for the resilver to complete, then i shut the system down.
> 
> then i physically removed c1t0d0 and put c1t1d0 in it's place.
> 
> I tried to boot the system, but it panics:

Afaik you can't remove the first disk. You've created a mirror of two
disks from either which you may boot the system. BUT the second disk
must remain where it is. You can set the bios to boot from it if the
first disk fails, but you may not *swap* them.

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | SunOS 10u7 5/09 | OpenSolaris 2010.02 b123
+ All that's really worth doing is what we do for others (Lewis Carrol)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss