Re: [zfs-discuss] any more efficient way to transfer snapshot between two hosts than ssh tunnel?

2012-12-13 Thread Adrian Smith
Hi Fred,

Try mbuffer (http://www.maier-komor.de/mbuffer.html)


On 14 December 2012 15:01, Fred Liu fred_...@issi.com wrote:

  Assuming in a secure and trusted env, we want to get the maximum
 transfer speed without the overhead from ssh.

 ** **

 Thanks.

 ** **

 Fred

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
Adrian Smith (ISUnix), Ext: 55070
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] grrr, How to get rid of mis-touched file named `-c'

2011-11-28 Thread Smith, David W.

You could list by inode, then use find with rm.

# ls -i
7223 -O

# find . -inum 7223 -exec rm {} \;

David

On 11/23/11 2:00 PM, Jason King (Gmail) jason.brian.k...@gmail.com
wrote:

 Did you try rm -- filename ?
 
 Sent from my iPhone
 
 On Nov 23, 2011, at 1:43 PM, Harry Putnam rea...@newsguy.com wrote:
 
 Somehow I touched some rather peculiar file names in ~.  Experimenting
 with something I've now forgotten I guess.
 
 Anyway I now have 3 zero length files with names -O, -c, -k.
 
 I've tried as many styles of escaping as I could come up with but all
 are rejected like this:
 
  rm \-c 
  rm: illegal option -- c
  usage: rm [-fiRr] file ...
 
 Ditto for:
 
  [\-]c
  '-c'
  *c
  '-'c
 \075c
 
 OK, I'm out of escapes.  or other tricks... other than using emacs but
 I haven't installed emacs as yet.
 
 I can just ignore them of course, until such time as I do get emacs
 installed, but by now I just want to know how it might be done from a
 shell prompt.
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to recover -- LUNs go offline, now permanent errors?

2011-07-18 Thread David Smith
Cindy,

I gave your suggestion a try.  I did the zpool clear and then did another zpool 
scrub  and all is happy now.  Thank you for your help.

David
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to recover -- LUNs go offline, now permanent errors?

2011-07-15 Thread David Smith
Cindy,

Thanks for the reply.  I'll get that a try and then send an update.

Thanks,

David
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] How to recover -- LUNs go offline, now permanent errors?

2011-07-13 Thread David Smith
I recently had an issue with my LUNs from our storage unit going offline.  This 
caused the zpool to get numerous errors on the luns.  The pool is on-line, and 
I did a scrub, but one of the raid sets is
degraded:

   raidz2-3 DEGRADED 0 0 0
c7t60001FF011C6F3103B00011D1BF1d0  DEGRADED 0 0 0  
too many errors
c7t60001FF011C6F3023900011D1BF1d0  DEGRADED 0 0 0  
too many errors
c7t60001FF011C6F2F53700011D1BF1d0  DEGRADED 0 0 0  
too many errors
c7t60001FF011C6F2E43500011D1BF1d0  DEGRADED 0 0 0  
too many errors
c7t60001FF011C6F2D23300011D1BF1d0  DEGRADED 0 0 0  
too many errors
c7t60001FF011C6F2A93100011D1BF1d0  DEGRADED 0 0 0  
too many errors
c7t60001FF011C6F29A2F00011D1BF1d0  DEGRADED 0 0 0  
too many errors
c7t60001FF011C6F2682D00011D1BF1d0  DEGRADED 0 0 0  
too many errors
c7t60001FF011C6F24C2B00011D1BF1d0  DEGRADED 0 0 0  
too many errors
c7t60001FF011C6F2192900011D1BF1d0  DEGRADED 0 0 0  
too many errors

Also I have the following:
errors: Permanent errors have been detected in the following files:

0x3a:0x3b04

Originally, there was a file, and then a directory listed, but I removed them.  
Now I'm stuck with
the hex codes above.  How do I interpret them?  Can this pool be recovered, or 
basically how do
I proceed?

The system is Solaris 10 U9 with all recent patches.

Thanks,

David
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool metadata corruption from S10U9 to S11 express

2011-06-23 Thread Smith, David W.


On 6/22/11 10:28 PM, Fajar A. Nugraha w...@fajar.net wrote:

 On Thu, Jun 23, 2011 at 9:28 AM, David W. Smith smith...@llnl.gov wrote:
 When I tried out Solaris 11, I just exported the pool prior to the install of
 Solaris 11.  I was lucky in that I had mirrored the boot drive, so after I
 had
 installed Solaris 11 I still had the other disk in the mirror with Solaris 10
 still
 installed.
 
 I didn't install any additional software in either environments with regards
 to
 volume management, etc.
 
 From the format command, I did remember seeing 60 luns coming from the DDN
 and
 as I recall I disk see multiple paths as well under Solaris 11.  I think you
 are
 correct however in that for some reason Solaris 11 could not read the
 devices.
 
 
 So you mean the root cause of the problem is Solaris Express failed to
 see the disks? Or are the disks available on solaris express as well?
 
 When you boot with Solaris Express Live CD, what does zpool import show?


Under Solaris 11 express, disks were seen with the format command, or like
luxadm probe, etc.  So I'm not sure why zpool import failed, or why I assume
could not read the devices.  I have not tried the Solaris Express live CD,
but I was booted off an installed version.

David

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool metadata corruption from S10U9 to S11 express

2011-06-23 Thread David W. Smith
path='/dev/dsk/c3t59d0s0'

devid='id1,sd@TATA_STEC_ZeusIOPS___018_GBytes__STMD039A/a'
phys_path='/pci@7c,0/pci10de,376@e/pci1000,3150@0/sd@3b,0:a'
whole_disk=1
create_txg=269718
children[1]
type='disk'
id=1
guid=2456972971894251597
path='/dev/dsk/c3t60d0s0'

devid='id1,sd@TATA_STEC_ZeusIOPS___018_GBytes__STMCFFC0/a'
phys_path='/pci@7c,0/pci10de,376@e/pci1000,3150@0/sd@3c,0:a'
whole_disk=1
create_txg=269718
rewind_txg_ts=1308690257
bad config type 7 for seconds_of_rewind
verify_data_errors=0

LABEL 3

version=22
name='tank'
state=0
txg=402415
pool_guid=13155614069147461689
hostid=799263814
hostname='Chaiten'
top_guid=7929625263716612584
guid=12265708552998034011
vdev_children=8
vdev_tree
type='mirror'
id=7
guid=7929625263716612584
metaslab_array=171
metaslab_shift=27
ashift=9
asize=18240241664
is_log=1
create_txg=269718
children[0]
type='disk'
id=0
guid=12265708552998034011
path='/dev/dsk/c3t59d0s0'

devid='id1,sd@TATA_STEC_ZeusIOPS___018_GBytes__STMD039A/a'
phys_path='/pci@7c,0/pci10de,376@e/pci1000,3150@0/sd@3b,0:a'
whole_disk=1
create_txg=269718
children[1]
type='disk'
id=1
guid=2456972971894251597
path='/dev/dsk/c3t60d0s0'

devid='id1,sd@TATA_STEC_ZeusIOPS___018_GBytes__STMCFFC0/a'
phys_path='/pci@7c,0/pci10de,376@e/pci1000,3150@0/sd@3c,0:a'
whole_disk=1
create_txg=269718
rewind_txg_ts=1308690257
bad config type 7 for seconds_of_rewind
verify_data_errors=0



Please let me know if you need more info...

Thanks,


David W. Smith

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Zpool metadata corruption from S10U9 to S11 express

2011-06-22 Thread David W. Smith

I was recently running Solaris 10 U9 and I decided that I would like to go
to Solaris 11 Express so I exported my zpool, hoping that I would just do
an import once I had the new system installed with Solaris 11.  Now when I
try to do an import I'm getting the following:

# /home/dws# zpool import
  pool: tank
id: 13155614069147461689
 state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-72
config:

tank FAULTED  corrupted data
logs
  mirror-6   ONLINE
c9t57d0  ONLINE
c9t58d0  ONLINE
  mirror-7   ONLINE
c9t59d0  ONLINE
c9t60d0  ONLINE

Is there something else I can do to see what is wrong.

Original attempt when specifying the name resulted in:

# /home/dws# zpool import tank
cannot import 'tank': I/O error
Destroy and re-create the pool from
a backup source.

I verified that I have all 60 of my luns.  The controller numbers have
changed, but I don't believe that should matter.

Any suggestions about getting additional information about what is happening 
would be greatly appreciated.

Thanks,

David

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Zpool metadata corruption from S10U9 to S11 express

2011-06-22 Thread David Smith
I was recently running Solaris 10 U9 and I decided that I would like to go  

to Solaris 11 Express so I exported my zpool, hoping that I would just do   

an import once I had the new system installed with Solaris 11.  Now when I  

try to do an import I'm getting the following:  



# /home/dws# zpool import   

  pool: tank

id: 13155614069147461689

 state: FAULTED 

status: The pool metadata is corrupted. 

action: The pool cannot be imported due to damaged devices or data. 

   see: http://www.sun.com/msg/ZFS-8000-72  

config: 



tank FAULTED  corrupted data

logs

  mirror-6   ONLINE 

c9t57d0  ONLINE 

c9t58d0  ONLINE 

  mirror-7   ONLINE 

c9t59d0  ONLINE 

c9t60d0  ONLINE 



Is there something else I can do to see what is wrong.  



Original attempt when specifying the name resulted in:  



# /home/dws# zpool import tank  

cannot import 'tank': I/O error 

Destroy and re-create the pool from 

a backup source.



I verified that I have all 60 of my luns.  The controller numbers have  

changed, but I don't believe that should matter. 

Any suggestions about getting additional information about what is happening

would be greatly appreciated.   



Thanks, 



David
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool metadata corruption from S10U9 to S11 express

2011-06-22 Thread David Smith
An update:

I had mirrored my boot drive when I installed Solaris 10U9 originally, so I 
went ahead and rebooted the system to this disk instead of my Solaris 11 
install.  After getting the system up, I imported the zpool, and everything 
worked normally.  

So I guess there is some sort of incompatibility between Solaris 10 and Solaris 
11.  I would have thought that Solaris 11 could import an older pool level.

Any other insight on importing pools between these two versions of Solaris 
would be helpful.

Thanks,

David
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool metadata corruption from S10U9 to S11 express

2011-06-22 Thread David W. Smith
On Wed, Jun 22, 2011 at 06:32:49PM -0700, Daniel Carosone wrote:
 On Wed, Jun 22, 2011 at 12:49:27PM -0700, David W. Smith wrote:
  # /home/dws# zpool import
pool: tank
  id: 13155614069147461689
   state: FAULTED
  status: The pool metadata is corrupted.
  action: The pool cannot be imported due to damaged devices or data.
 see: http://www.sun.com/msg/ZFS-8000-72
  config:
  
  tank FAULTED  corrupted data
  logs
mirror-6   ONLINE
  c9t57d0  ONLINE
  c9t58d0  ONLINE
mirror-7   ONLINE
  c9t59d0  ONLINE
  c9t60d0  ONLINE
  
  Is there something else I can do to see what is wrong.
 
 Can you tell us more about the setup, in particular the drivers and
 hardware on the path?  There may be labelling, block size, offset or
 even bad drivers or other issues getting in the way, preventing ZFS
 from doing what should otherwise be expected to work.   Was there
 something else in the storage stack on the old OS, like a different
 volume manager or some multipathing?
 
 Can you show us the zfs labels with zdb -l /dev/foo ?
 
 Does import -F get any further?
 
  Original attempt when specifying the name resulted in:
  
  # /home/dws# zpool import tank
  cannot import 'tank': I/O error
 
 Some kind of underlying driver problem odour here.
 
 --
 Dan.

The system is an x4440 with two dual port Qlogic 8 Gbit FC cards connected to a 
DDN 9900 storage unit.  There are 60 luns configured from the storage unit we 
using
raidz1 across these luns in a 9+1 configuration.  Under Solaris 10U9 
multipathing
is enabled.

For example here is one of the devices:


# luxadm display /dev/rdsk/c8t60001FF010DC50AA2E00081D1BF1d0s2
DEVICE PROPERTIES for disk: /dev/rdsk/c8t60001FF010DC50AA2E00081D1BF1d0s2
  Vendor:   DDN 
  Product ID:   S2A 9900
  Revision: 6.11
  Serial Num:   10DC50AA002E
  Unformatted capacity: 15261576.000 MBytes
  Write Cache:  Enabled
  Read Cache:   Enabled
Minimum prefetch:   0x0
Maximum prefetch:   0x0
  Device Type:  Disk device
  Path(s):

  /dev/rdsk/c8t60001FF010DC50AA2E00081D1BF1d0s2
  /devices/scsi_vhci/disk@g60001ff010dc50aa2e00081d1bf1:c,raw
   Controller   /dev/cfg/c5
Device Address  2401ff051232,2e
Host controller port WWN2101001b32bfe1d3
Class   secondary
State   ONLINE
   Controller   /dev/cfg/c7
Device Address  2801ff0510dc,2e
Host controller port WWN2101001b32bd4f8f
Class   primary
State   ONLINE


Here is the output of the zdb command:

# zdb -l /dev/dsk/c8t60001FF010DC50AA2E00081D1BF1d0s0

LABEL 0

version=22
name='tank'
state=0
txg=402415
pool_guid=13155614069147461689
hostid=799263814
hostname='Chaiten'
top_guid=7879214599529115091
guid=9439709931602673823
vdev_children=8
vdev_tree
type='raidz'
id=5
guid=7879214599529115091
nparity=1
metaslab_array=35
metaslab_shift=40
ashift=12
asize=160028491776000
is_log=0
create_txg=22
children[0]
type='disk'
id=0
guid=15738823520260019536
path='/dev/dsk/c8t60001FF0123252803700081D1BF1d0s0'
devid='id1,sd@n60001ff0123252803700081d1bf1/a'
phys_path='/scsi_vhci/disk@g60001ff0123252803700081d1bf1:a'
whole_disk=1
DTL=166
create_txg=22
children[1]
type='disk'
id=1
guid=7241121769141495862
path='/dev/dsk/c8t60001FF010DC50C53600081D1BF1d0s0'
devid='id1,sd@n60001ff010dc50c53600081d1bf1/a'
phys_path='/scsi_vhci/disk@g60001ff010dc50c53600081d1bf1:a'
whole_disk=1
DTL=165
create_txg=22
children[2]
type='disk'
id=2
guid=2777230007222012140
path='/dev/dsk/c8t60001FF0123252793500081D1BF1d0s0'
devid='id1,sd@n60001ff0123252793500081d1bf1/a'
phys_path='/scsi_vhci/disk@g60001ff0123252793500081d1bf1:a'
whole_disk=1
DTL=164
create_txg=22
children[3]
type='disk'
id=3
guid=5525323314985659974
path='/dev/dsk/c8t60001FF010DC50BE3400081D1BF1d0s0'
devid='id1,sd@n60001ff010dc50be3400081d1bf1/a'
phys_path='/scsi_vhci/disk@g60001ff010dc50be3400081d1bf1:a'
whole_disk=1
DTL=163

Re: [zfs-discuss] Question on ZFS iSCSI

2011-06-01 Thread a . smith

Disk /dev/zvol/rdsk/pool/dcpool: 4295GB
Sector size (logical/physical): 512B/512B



Just to check, did you already try:

zpool import -d /dev/zvol/rdsk/pool/ poolname

?

thanks Andy.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Oracle and Nexenta

2011-05-25 Thread a . smith

Still i wonder what Gartner means with Oracle monetizing on ZFS..


It simply means that Oracle want to make money from ZFS (as is normal  
for technology companies with their own technology). The reason this  
might cause uncertainty for ZFS is that maintaining or helping make  
the open source version of ZFS better may be seen by Oracle as  
contradictory to them making money from it.
That said, what is already open source cannot be un-open sourced, as  
others have said...


cheers Andy.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Monitoring disk seeks

2011-05-24 Thread a . smith

Hi,

  see the seeksize script on this URL:

http://prefetch.net/articles/solaris.dtracetopten.html

Not used it but looks neat!

cheers Andy.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris vs FreeBSD question

2011-05-18 Thread a . smith

Hi,

  I am using FreeBSD 8.2 in production with ZFS. Although I have had  
one issue with it in the past but I would recommend it and I consider  
it production ready. That said if you can wait for FreeBSD 8.3 or 9.0  
to come out (a few months away) you will get a better system as these  
will include ZFS v28 (FreeBSD-RELEASE is currently v15).
On the other had things can always go wrong, of course RAID is not  
backup, even with snapshots ;)


cheers Andy.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS send/recv initial data load

2011-02-16 Thread a . smith

On Feb 16, 2011, at 7:38 AM, whitetr6 at gmail.com wrote:

My question is about the initial seed of the data. Is it possible  
to use a portable drive to copy the initial zfs filesystem(s) to the  
remote location and then make the subsequent incrementals over the  
network? If so, what would I need to do to make sure it is an exact  
copy? Thank you,


Yes, you can send the initial seed snapshot to a file on a portable  
disk. for example:


 # zfs send tank/volume@seed  /myexternaldrive/zfssnap.data

If the volume of data is too much to fit on a single disk then you can  
create a new pool spread across the number of disks you require, make  
a duplicate of the snapshot onto your new pool. Then from the new pool  
you can run a new zfs send when connected to your offsite server.


thanks Andy.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Drive i/o anomaly

2011-02-08 Thread a . smith

It is a 4k sector drive, but I thought zfs recognised those drives and didn't
need any special configuration...?


4k drives are a big problem for ZFS, much has been posted/written  
about it. Basically, if the 4k drives report 512 byte blocks, as they  
almost all do, then ZFS does not detect and configure the pool  
correctly. If the drive actually reports the real 4k block size, ZFS  
handles this very nicely.
So the problem/fault is drives misreporting the real block size, to  
maintain compatibility with other OS's etc, and not really with ZFS.


cheers Andy.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool scalability and performance

2011-01-13 Thread a . smith
Basically I think yes you need to add all the vdevs you require in the  
circumstances you describe.


You just have to consider what ZFS is able to do with the disks that  
you give it. If you have 4x mirrors to start with then all writes will  
be spread across all disks and you will get nice performance using all  
8 spindles/disks. If you fill all of these up then add one other  
mirror then its logical that new data written will be only written to  
the free space on the new mirror and you will get the performance of  
writing data to a single mirrored vdev.


To handle this you would either have to add sufficient new devices to  
give you your required performance. Or if there is a fair amount of  
data turn around on your pool, ie you are deleting (including from  
snapshots) old data then you might get reasonable performance by  
adding a new mirror at some point before your existing pool is  
completely full. Ie data will initially get written and spread across  
all disks as there will be free space on all disks, and over time old  
data will be removed from the other older vdevs. Which would result in  
most of the time reads and writes benefiting from all vdevs, but it't  
not going to give you guarantees of that I guess...


Anyway, thats what occurred to me on the subject! ;)

cheers Andy.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS percent busy vs zpool iostat

2011-01-12 Thread a . smith

Quoting Bob Friesenhahn bfrie...@simple.dallas.tx.us:



What function is the system performing when it is so busy?


The work load of the server is SMTP mail server, with associated spam  
and virus scanning, and serving maildir email via POP3 and IMAP.




Wrong conclusion.  I am not sure what the percentages are  
percentages of (total RAM?), but 603MB is a very small ARC.  FreeBSD  
pre-assigns kernel memory for zfs so it is not dynamically shared  
with the kernel as it is with Solaris.


This is the min, max, and actual size of the ARC. ZFS is free to use  
up to the MAX (2098.08M) if it decides it wants to. Depending on the  
work load on this server it will go up to 2098M (as in Ive seen it get  
to that size on this and other servers), just with its usual daily  
work load it decides to set this to around 600M. I assume it decides  
it's not worth using any more RAM.


The ARC is adaptive so you should not assume that its objective is  
to try to absorb your hard drive.  It should not want to cache data  
which is rarely accessed.  Regardless,  your ARC size may actually  
be constrained by default FreeBSD kernel tunings.


I guess then that ZFS is weighing up how useful it is to use more than  
600M and deciding that it isnt that useful? Anyway, Ive just  
forced the Min to 1900M so will see how this goes today.




The type of drives you are using have very poor seek performance.  
Higher RPM drives would surely help.  Stuffing lots more memory in  
your system and adjusting the kernel so that zfs can use a lot more  
of it is likely to help dramatically.  Zfs loves memory.


thanks Bob, and also to Matt for your comments...



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS percent busy vs zpool iostat

2011-01-12 Thread a . smith
Ok, think I have the biggest issue. The drives are 4k sector drives,  
and I wasn't aware of that. My fault, I should have checked this. Had  
the disks for ages and are sub 1TB so had the idea that they wouldn't  
be 4k drives...


I will obviously have to address this, either by creating a pool using  
4k aware zfs commands or replacing the disks.


Anyway, thanks to all and to Taemun for getting me to check this...



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] System crash on zpool attach object_count == usedobjs failed assertion

2010-03-03 Thread Nigel Smith
I've just run zdb against the two pools on my home OpenSolaris box,
and now both are showing this failed assertion, with the counts off by one.

  # zdb rpool /dev/null
  Assertion failed: object_count == usedobjs (0x18da2 == 0x18da3), file 
../zdb.c, line 1460
  Abort (core dumped)

  # zdb rz2pool /dev/null
  Assertion failed: object_count == usedobjs (0x2ba25 == 0x2ba26), file 
../zdb.c, line 1460
  Abort (core dumped)

The last time I checked them with zdb, probably a few months back,
they were fine.

And since the pools otherwise seem to be behaving without problem,
I've had no reason to run zdb.

'zpool status' looks fine, and the pools mount without problem.
'zpool scrub' works without problem.

I have been upgrading to most of the recent 'dev' version of OpenSolaris.
I wonder if there is some bug in the code that could cause this assertion.

Maybe one unusual thing, is that I have not yet upgraded the 
versions of the pools.

  # uname -a
  SunOS opensolaris 5.11 snv_133 i86pc i386 i86pc  
  # zpool upgrade
  This system is currently running ZFS pool version 22.

  The following pools are out of date, and can be upgraded.  After being
  upgraded, these pools will no longer be accessible by older software versions.

  VER  POOL
  ---  
  13   rpool
  16   rz2pool

The assertions is being tracked by this bug:

  http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6801840

..but in that report, the counts are not off by one,
Unfortunately, there is little indication of any progress being made.

Maybe some other 'zfs-discuss' readers would try zdb on there pools,
if using a recent dev build and see if they get a similar problem...

Thanks
Nigel Smith


# mdb core
Loading modules: [ libumem.so.1 libc.so.1 libzpool.so.1 libtopo.so.1 
libavl.so.1 libnvpair.so.1 ld.so.1 ]
 ::status
debugging core file of zdb (64-bit) from opensolaris
file: /usr/sbin/amd64/zdb
initial argv: zdb rpool
threading model: native threads
status: process terminated by SIGABRT (Abort), pid=883 uid=0 code=-1
panic message:
Assertion failed: object_count == usedobjs (0x18da2 == 0x18da3), file ../zdb.c,
line 1460
 $C
fd7fffdff090 libc.so.1`_lwp_kill+0xa()
fd7fffdff0b0 libc.so.1`raise+0x19()
fd7fffdff0f0 libc.so.1`abort+0xd9()
fd7fffdff320 libc.so.1`_assert+0x7d()
fd7fffdff810 dump_dir+0x35a()
fd7fffdff840 dump_one_dir+0x54()
fd7fffdff850 libzpool.so.1`findfunc+0xf()
fd7fffdff940 libzpool.so.1`dmu_objset_find_spa+0x39f()
fd7fffdffa30 libzpool.so.1`dmu_objset_find_spa+0x1d2()
fd7fffdffb20 libzpool.so.1`dmu_objset_find_spa+0x1d2()
fd7fffdffb40 libzpool.so.1`dmu_objset_find+0x2c()
fd7fffdffb70 dump_zpool+0x197()
fd7fffdffc10 main+0xa3d()
fd7fffdffc20 0x406e6c()
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] System crash on zpool attach object_count == usedobjs failed assertion

2010-03-03 Thread Nigel Smith
Hi Stephen 

If your system is crashing while attaching the new device,
are you getting a core dump file?

If so, it would be interesting to examine the file with mdb,
to see the stack backtrace, as this may give a clue to what's going wrong.

What storage controller you are using for the disks?
And what device driver is the controller using?

Thanks
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] crashed zpool

2010-03-01 Thread Nigel Smith
Hello Carsten

Have you examined the core dump file with mdb ::stack
to see if this give a clue to what happend?

Regards
Nigel
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with itadm commands

2010-02-23 Thread Nigel Smith
The iSCSI COMSTAR Port Provider is not installed by default.
What release of OpenSolaris are you running?
If pre snv_133 then:

  $ pfexec pkg install  SUNWiscsit

For snv_133, I think it will be:

  $ pfexec pkg install  network/iscsi/target

Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-18 Thread Nigel Smith
Hi Matt
Are the seeing low speeds on writes only or on both read AND write?

Are you seeing low speed just with iSCSI or also with NFS or CIFS?

 I've tried updating to COMSTAR 
 (although I'm not certain that I'm actually using it)

To check, do this:

  # svcs -a | grep iscsi

If 'svc:/system/iscsitgt:default' is online,
you are using the old  mature 'user mode' iscsi target.

If 'svc:/network/iscsi/target:default' is online,
then you are using the new 'kernel mode' comstar iscsi target.

For another good way to monitor disk i/o, try:

  # iostat -xndz 1

  http://docs.sun.com/app/docs/doc/819-2240/iostat-1m?a=view

Don't just assume that your Ethernet  IP  TCP layer
are performing to the optimum - check it.

I often use 'iperf' or 'netperf' to do this:

  http://blogs.sun.com/observatory/entry/netperf

(Iperf is available by installing the SUNWiperf package.
A package for netperf is in the contrib repository.)

The last time I checked, the default values used
in the OpenSolaris TCP stack are not optimum
for Gigabit speed, and need to be adjusted.
Here is some advice, I found with Google, but
there are others:

  
http://serverfault.com/questions/13190/what-are-good-speeds-for-iscsi-and-nfs-over-1gb-ethernet

BTW, what sort of network card are you using,
as this can make a difference.

Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-18 Thread Nigel Smith
Hi Matt

 Haven't gotten NFS or CIFS to work properly.
 Maybe I'm just too dumb to figure it out,
 but I'm ending up with permissions errors that don't let me do much.
 All testing so far has been with iSCSI.

So until you can test NFS or CIFS, we don't know if it's a 
general performance problem, or just an iSCSI problem.

To get CIFS working, try this:

  
http://blogs.sun.com/observatory/entry/accessing_opensolaris_shares_from_windows

 Here's IOStat while doing writes : 
 Here's IOStat when doing reads : 

Your getting 1000 Kr/s  kw/s, so add the iostat 'M' option
to display throughput in MegaBytes per second.

 It'll sustain 10-12% gigabit for a few minutes, have a little dip,

I'd still be interested to see the size of the TCP buffers.
What does this report:

# ndd /dev/tcp  tcp_xmit_hiwat
# ndd /dev/tcp  tcp_recv_hiwat
# ndd /dev/tcp  tcp_conn_req_max_q
# ndd /dev/tcp  tcp_conn_req_max_q0

 Current NIC is an integrated NIC on an Abit Fatality motherboard.
 Just your generic fare gigabit network card.
 I can't imagine that it would be holding me back that much though.

Well there are sometimes bugs in the device drivers:

  http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6913756
  http://sigtar.com/2009/02/12/opensolaris-rtl81118168b-issues/

That's why I say don't just assume the network is performing to the optimum.

To do a local test, direct to the hard drives, you could try 'dd',
with various transfer sizes. Some advice from BenR, here:

  http://www.cuddletech.com/blog/pivot/entry.php?id=820

Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-18 Thread Nigel Smith
Another things you could check, which has been reported to
cause a problem, is if network or disk drivers share an interrupt
with a slow device, like say a usb device. So try:

# echo ::interrupts -d | mdb -k

... and look for multiple driver names on an INT#.
Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Idiots Guide to Running a NAS with ZFS/OpenSolaris

2010-02-18 Thread Nigel Smith
Hi Robert 
Have a look at these links:

  http://delicious.com/nwsmith/opensolaris-nas

Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disk Issues

2010-02-16 Thread Nigel Smith
I have booted up an osol-dev-131 live CD on a Dell Precision T7500,
and the AHCI driver successfully loaded, to give access
to the two sata DVD drives in the machine.

(Unfortunately, I did not have the opportunity to attach
any hard drives, but I would expect that also to work.)

'scanpci' identified the southbridge as an
Intel 82801JI (ICH10 family)
Vendor 0x8086, device 0x3a22

AFAIK, as long as the SATA interface report a PCI ID
class-code of 010601, then the AHCI device driver 
should load.

The mode of the SATA interface will need to be selected in the BIOS.
There are normally three modes: Native IDE, RAID or AHCI.

'scanpci' should report different class-codes depending
on the mode selected in the BIOS.

RAID mode should report a class-code of 010400
IDE mode should report a class-code of 0101xx

With OpenSolaris, you can see the class-code in the
output from 'prtconf -pv'.

If Native IDE is selected the ICH10 SATA interface should
appear as two controllers, the first for ports 0-3,
and the second for ports 4  5.

Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Painfully slow RAIDZ2 as fibre channel COMSTAR export

2010-02-14 Thread Nigel Smith
Hi Dave
So which hard drives are connected to which controllers?
And what device drivers are those controllers using?

The output from 'format', 'cfgadm' and 'prtconf -D'
may help us to understand.

Strange that you say that there are two hard drives
per controllers, but three drives are showing
high %b.

And strange that you have c7,c8,c9,c10,c11
which looks like FIVE controllers!

Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ..and now ZFS send dedupe

2009-11-09 Thread Nigel Smith
More ZFS goodness putback before close of play for snv_128.

  http://mail.opensolaris.org/pipermail/onnv-notify/2009-November/010768.html

  http://hg.genunix.org/onnv-gate.hg/rev/216d8396182e

Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + fsck

2009-11-09 Thread Nigel Smith
On Thu Nov 5 14:38:13 PST 2009, Gary Mills wrote:
 It would be nice to see this information at:
 http://hub.opensolaris.org/bin/view/Community+Group+on/126-130
 but it hasn't changed since 23 October.

Well it seems we have an answer:

http://mail.opensolaris.org/pipermail/zfs-discuss/2009-November/033672.html

On Mon Nov 9 14:26:54 PST 2009, James C. McPherson wrote:
 The flag days page has not been updated since the switch
 to XWiki, it's on my todo list but I don't have an ETA
 for when it'll be done.

Perhaps anyone interested in seeing the flags days page
resurrected can petition James to raise the priority on
his todo list.
Thanks
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] marvell88sx2 driver build126

2009-11-08 Thread Nigel Smith
I think you can work out the files for the driver by looking here:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/pkgdefs/SUNWmv88sx/prototype_i386

So the 32 bit driver is:

 kernel/drv/marvell88sx

And the 64 bit driver is:

 kernel/drv/amd64/marvell88sx

It a pity that the marvell driver is not open source.
For the sata drivers that are open source,

  ahci, nv_sata, si3124

..you can see the history of all the changes to the source code
of the drivers, all cross referenced to the bug numbers, using OpenGrok:

  
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/sata/adapters/

Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + fsck

2009-11-05 Thread Nigel Smith
Hi Robert
I think you mean snv_128 not 126 :-)

  6667683  need a way to rollback to an uberblock from a previous txg 
  http://bugs.opensolaris.org/view_bug.do?bug_id=6667683

  http://hg.genunix.org/onnv-gate.hg/rev/8aac17999e4d

Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + fsck

2009-11-05 Thread Nigel Smith
Hi Gary
I will let 'website-discuss' know about this problem.
They normally fix issues like that.
Those pages always seemed to just update automatically.
I guess it's related to the website transition.
Thanks
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedupe is in

2009-11-02 Thread Ross Smith
Ok, thanks everyone then (but still thanks to Victor for the heads up)  :-)


On Mon, Nov 2, 2009 at 4:03 PM, Victor Latushkin
victor.latush...@sun.com wrote:
 On 02.11.09 18:38, Ross wrote:

 Double WOHOO!  Thanks Victor!

 Thanks should go to Tim Haley, Jeff Bonwick and George Wilson ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedupe is in

2009-11-02 Thread Nigel Smith
ZFS dedup will be in snv_128,
but putbacks to snv_128 will not likely close till the end of this week.

The OpenSolaris dev repository was updated to snv_126 last Thursday:
http://mail.opensolaris.org/pipermail/opensolaris-announce/2009-October/001317.html

So it looks like about 5 weeks before the dev
repository will be updated to snv_128.

Then we see if any bugs emerge as we all rush to test it out...
Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs corrupts grub

2009-10-03 Thread Terry Smith
This is opensolaris on a Tecra M5 using an 128GB SSD as the boot device.  This 
device is partitioned into two roughly 60GB partitions.  

I installed opensolaris 2009.06 into the first partition then did an image 
update to build 124 from the dev repository.  All went well so then I created a 
zpool from the second partition , which created fine and I could add 
filesystems to that pool.  however when I can to reboot the laptop there was a 
message ( I think from bootadm ) about an unrecognised GRUB entry and the the 
reboot stopped with the words GRUB appearing at the top left of the screen.  So 
it appears that zfs has done something to the grub entry such that I can no 
longer boot the laptop.  Anyone have any ideas how to either recover from this 
and/or prevent this happening in the future.

T
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] How to map solaris disk devices to physical location for ZFS pool setup

2009-09-15 Thread David Smith
Hi, I'm setting up a ZFS environment running on a Sun x4440 + J4400 arrays 
(similar to 7410 environment) and I was trying to figure out the best way to 
map a disk drive physical location (tray and slot) to the Solaris device 
c#t#d#.   Do I need to install the CAM software to do this, or is there another 
way?  I would like to understand the solaris device to physical drive location 
so that I can setup my ZFS pool mirrors/raid properly.

I'm currently running Solaris Express build 119.

Thanks,

David
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Read about ZFS backup - Still confused

2009-09-03 Thread Cork Smith
I am just a simple home user. When I was using linux, I backed up my home 
directory (which contained all my critical data) using tar. I backed up my 
linux partition using partimage. These backups were put on dvd's. That way I 
could restore (and have) even if the hard drive completely went belly up.

I would like to duplicate this scheme using zfs commands. I know I can copy a 
snapshot to a dvd but can I recover using just the snapshot or does it rely on 
the zfs file system on my hard drive being ok?

Cork
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Read about ZFS backup - Still confused

2009-09-03 Thread Cork Smith
Let me try rephrasing this. I would like the ability to restore so my system 
mirrors its state at the time when I backed it up given the old hard drive is 
now a door stop.

Cork
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool

2009-09-02 Thread Nigel Smith
Adam
The 'OpenSolaris Development Release Packaging Repository'
has recently been updated to release 121.

  
http://mail.opensolaris.org/pipermail/opensolaris-announce/2009-August/001253.html
  http://pkg.opensolaris.org/dev/en/index.shtml

Just to be totally clear, as you recommending that anyone
using raidz, raidz2, raidz3, should not upgrade to that release?

For the people who have already upgraded, presumably the
recommendation is that they should revert to a pre 121 BE.

Thanks
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Confusion

2009-08-25 Thread Stephen Nelson-Smith
Hi Volker,


On Fri, Aug 21, 2009 at 5:42 PM, Volker A. Brandtv...@bb-c.de wrote:
  Can you actually see the literal commands?  A bit like MySQL's 'show
  create table'?  Or are you just intrepreting the output?

 Just interpreting the output.

 Actually you could see the commands on the old server by using

  zpool history oradata

That's awesome - thank you very much!

S.
-- 
Stephen Nelson-Smith
Technical Director
Atalanta Systems Ltd
www.atalanta-systems.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Confusion

2009-08-21 Thread Stephen Nelson-Smith
Sorry - didn't realised I'd replied only to you.
 You can either set the mountpoint property when you create the dataset or do 
 it
 in a second operation after the create.

 Either:
 # zfs create -o mountpoint=/u01 rpool/u01

 or:
 # zfs create rpool/u01
 # zfs set mountpoint=/u01 rpool/u01

Got you.

 I'm not sure about the remote mount.  It appears to be a local SMB resource
 mounted as NFS?  I've never seen that before.

Ah that's just a Sharity mount - it's a red herring.  u0[1-4] will be the same.

Thanks very much,

S.
-- 
Stephen Nelson-Smith
Technical Director
Atalanta Systems Ltd
www.atalanta-systems.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Shrinking a zpool?

2009-08-06 Thread Nigel Smith
Hi Matt
Thanks for this update, and the confirmation
to the outside world that this problem is being actively
worked on with significant resources.

But I would like to support Cyril's comment.

AFAIK, any updates you are making to bug 4852783 are not
available to the outside world via the normal bug URL.
It would be useful if we were able to see them.

I think it is frustrating for the outside world that
it cannot see Sun's internal source code repositories
for work in progress, and only see the code when it is
complete and pushed out.

And so there is no way to judge what progress is being made,
or to actively help with code reviews or testing.

Best Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Shrinking a zpool?

2009-08-06 Thread Nigel Smith
ob Friesenhahn wrote:
 Sun has placed themselves in the interesting predicament that being 
 open about progress on certain high-profile enterprise features 
 (such as shrink and de-duplication) could cause them to lose sales to 
 a competitor.  Perhaps this is a reason why Sun is not nearly as open 
 as we would like them to be.

I agree that it is difficult for Sun, at this time, to 
be more 'open', especially for ZFS, as we still await the resolution
of Oracle purchasing Sun, the court case with NetApp over patents,
and now the GreenBytes issue!

But I would say they are more likely to avoid loosing sales
by confirming what enhancements they are prioritising.
I think people will wait if they know work is being done,
and progress being made, although not indefinitely.

I guess it depends on the rate of progress of ZFS compared to say btrfs.

I would say that maybe Sun should have held back on
announcing the work on deduplication, as it just seems to 
have ramped up frustration, now that it seems no
more news is forthcoming. It's easy to be wise after the event
and time will tell.

Thanks
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tunable iSCSI timeouts - ZFS over iSCSI fix

2009-07-29 Thread Ross Smith
Yup, somebody pointed that out to me last week and I can't wait :-)


On Wed, Jul 29, 2009 at 7:48 PM, Davedave-...@dubkat.com wrote:
 Anyone (Ross?) creating ZFS pools over iSCSI connections will want to pay
 attention to snv_121 which fixes the 3 minute hang after iSCSI disk
 problems:

 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=649

 Yay!

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-27 Thread Nigel Smith
David Magda wrote:
 This is also (theoretically) why a drive purchased from Sun is more  
 that expensive then a drive purchased from your neighbourhood computer  
 shop: Sun (and presumably other manufacturers) takes the time and  
 effort to test things to make sure that when a drive says I've synced  
 the data, it actually has synced the data. This testing is what  
 you're presumably paying for.

So how do you test a hard drive to check it does actually sync the data?
How would you do it in theory?
And in practice?

Now say we are talking about a virtual hard drive,
rather than a physical hard drive.
How would that affect the answer to the above questions?

Thanks
Nigel
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS gzip Death Spiral Revisited

2009-06-13 Thread jeramy smith
I have the following configuation.

My storage:
12 luns from a Clariion 3x80. Each LUN is a whole 6 disk raid-6.

My host:
Sun t5240 with 32 hardware threads and 16gig of ram.

My zpool:
all 12 luns from the clariion in a simple pool

My test data:
A 1 gig backup file of a ufsdump from /opt on a machine with lots of
mixed binary/text data.
A 15gig file that is already tightly compressed.

I wrote some benchmarks and tested. This system is completely idle
except for testing.
With the 1 gig file:
testing record sizes for 8,16,32,128k
testing compression with off,on,gzip
128k record sizes were fastest.
gzip compression was fastest.

Using the best of those results, I then ran the torture test with a
file almost as large as system memory that was already compressed. The
results were the infamous lock up, stutter, cant kill the cp/dd
command, oh god, system console is unresponsive too, what has science
done?!?!.

In the past threads I dug up, it seems that people were using wimpier
hardware or gzip-9 and running into this. I ran into it with very
capable hardware.

I do not get this behavior using the default lzjb compression, and I
was able to also produce it using weaker gzip-3 compression.


Is there a fix for this I am not aware of? Workaround? Etc? gzip
compression works wonderfully with the uncompressed smallers 1-4g
files I am trying. It would be a shame to use the weaker default
compression because of this test case.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Comstar production-ready?

2009-03-03 Thread Stephen Nelson-Smith
Hi,

I recommended a ZFS-based archive solution to a client needing to have
a network-based archive of 15TB of data in a remote datacentre.  I
based this on an X2200 + J4400, Solaris 10 + rsync.

This was enthusiastically received, to the extent that the client is
now requesting that their live system (15TB data on cheap SAN and
Linux LVM) be replaced with a ZFS-based system.

The catch is that they're not ready to move their production systems
off Linux - so web, db and app layer will all still be on RHEL 5.

As I see it, if they want to benefit from ZFS at the storage layer,
the obvious solution would be a NAS system, such as a 7210, or
something buillt from a JBOD and a head node that does something
similar.  The 7210 is out of budget - and I'm not quite sure how it
presents its storage - is it NFS/CIFS?  If so, presumably it would be
relatively easy to build something equivalent, but without the
(awesome) interface.

The interesting alternative is to set up Comstar on SXCE, create
zpools and volumes, and make these available either over a fibre
infrastructure, or iSCSI.  I'm quite excited by this as a solution,
but I'm not sure if it's really production ready.

What other options are there, and what advice/experience can you share?

Thanks,

S.
-- 
Stephen Nelson-Smith
Technical Director
Atalanta Systems Ltd
www.atalanta-systems.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-14 Thread Ross Smith
Hey guys,

I'll let this die in a sec, but I just wanted to say that I've gone
and read the on disk document again this morning, and to be honest
Richard, without the description you just wrote, I really wouldn't
have known that uberblocks are in a 128 entry circular queue that's 4x
redundant.

Please understand that I'm not asking for answers to these notes, this
post is purely to illustrate to you ZFS guys that much as I appreciate
having the ZFS docs available, they are very tough going for anybody
who isn't a ZFS developer.  I consider myself well above average in IT
ability, and I've really spent quite a lot of time in the past year
reading around ZFS, but even so I would definitely have come to the
wrong conclusion regarding uberblocks.

Richard's post I can understand really easily, but in the on disk
format docs, that information is spread over 7 pages of really quite
technical detail, and to be honest, for a user like myself raises as
many questions as it answers:

On page 6 I learn that labels are stored on each vdev, as well as each
disk.  So there will be a label on the pool, mirror (or raid group),
and disk.  I know the disk ones are at the start and end of the disk,
and it sounds like the mirror vdev is in the same place, but where is
the root vdev label?  The example given doesn't mention its location
at all.

Then, on page 7 it sounds like the entire label is overwriten whenever
on-disk data is updated - any time on-disk data is overwritten, there
is potential for error.  To me, it sounds like it's not a 128 entry
queue, but just a group of 4 labels, all of which are overwritten as
data goes to disk.

Then finally, on page 12 the uberblock is mentioned (although as an
aside, the first time I read these docs I had no idea what the
uberblock actually was).  It does say that only one uberblock is
active at a time, but with it being part of the label I'd just assume
these were overwritten as a group..

And that's why I'll often throw ideas out - I can either rely on my
own limited knowledge of ZFS to say if it will work, or I can take
advantage of the excellent community we have here, and post the idea
for all to see.  It's a quick way for good ideas to be improved upon,
and bad ideas consigned to the bin.  I've done it before in my rather
lengthly 'zfs availability' thread.  My thoughts there were thrashed
out nicely, with some quite superb additions (namely the concept of
lop sided mirrors which I think are a great idea).

Ross

PS.  I've also found why I thought you had to search for these blocks,
it was after reading this thread where somebody used mdb to search a
corrupt pool to try to recover data:
http://opensolaris.org/jive/message.jspa?messageID=318009







On Fri, Feb 13, 2009 at 11:09 PM, Richard Elling
richard.ell...@gmail.com wrote:
 Tim wrote:


 On Fri, Feb 13, 2009 at 4:21 PM, Bob Friesenhahn
 bfrie...@simple.dallas.tx.us mailto:bfrie...@simple.dallas.tx.us wrote:

On Fri, 13 Feb 2009, Ross Smith wrote:

However, I've just had another idea.  Since the uberblocks are
pretty
vital in recovering a pool, and I believe it's a fair bit of
work to
search the disk to find them.  Might it be a good idea to
allow ZFS to
store uberblock locations elsewhere for recovery purposes?


Perhaps it is best to leave decisions on these issues to the ZFS
designers who know how things work.

Previous descriptions from people who do know how things work
didn't make it sound very difficult to find the last 20
uberblocks.  It sounded like they were at known points for any
given pool.

Those folks have surely tired of this discussion by now and are
working on actual code rather than reading idle discussion between
several people who don't know the details of how things work.



 People who don't know how things work often aren't tied down by the
 baggage of knowing how things work.  Which leads to creative solutions those
 who are weighed down didn't think of.  I don't think it hurts in the least
 to throw out some ideas.  If they aren't valid, it's not hard to ignore them
 and move on.  It surely isn't a waste of anyone's time to spend 5 minutes
 reading a response and weighing if the idea is valid or not.

 OTOH, anyone who followed this discussion the last few times, has looked
 at the on-disk format documents, or reviewed the source code would know
 that the uberblocks are kept in an 128-entry circular queue which is 4x
 redundant with 2 copies each at the beginning and end of the vdev.
 Other metadata, by default, is 2x redundant and spatially diverse.

 Clearly, the failure mode being hashed out here has resulted in the defeat
 of those protections. The only real question is how fast Jeff can roll out
 the
 feature to allow reverting to previous uberblocks.  The procedure for doing
 this by hand has long been known, and was posted on this forum -- though
 it is tedious.
 -- richard

Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-13 Thread Ross Smith
On Fri, Feb 13, 2009 at 7:41 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Fri, 13 Feb 2009, Ross wrote:

 Something like that will have people praising ZFS' ability to safeguard
 their data, and the way it recovers even after system crashes or when
 hardware has gone wrong.  You could even have a common causes of this
 are... message, or a link to an online help article if you wanted people to
 be really impressed.

 I see a career in politics for you.  Barring an operating system
 implementation bug, the type of problem you are talking about is due to
 improperly working hardware.  Irreversibly reverting to a previous
 checkpoint may or may not obtain the correct data.  Perhaps it will produce
 a bunch of checksum errors.

Yes, the root cause is improperly working hardware (or an OS bug like
6424510), but with ZFS being a copy on write system, when errors occur
with a recent write, for the vast majority of the pools out there you
still have huge amounts of data that is still perfectly valid and
should be accessible.  Unless I'm misunderstanding something,
reverting to a previous checkpoint gets you back to a state where ZFS
knows it's good (or at least where ZFS can verify whether it's good or
not).

You have to consider that even with improperly working hardware, ZFS
has been checksumming data, so if that hardware has been working for
any length of time, you *know* that the data on it is good.

Yes, if you have databases or files there that were mid-write, they
will almost certainly be corrupted.  But at least your filesystem is
back, and it's in as good a state as it's going to be given that in
order for your pool to be in this position, your hardware went wrong
mid-write.

And as an added bonus, if you're using ZFS snapshots, now your pool is
accessible, you have a bunch of backups available so you can probably
roll corrupted files back to working versions.

For me, that is about as good as you can get in terms of handling a
sudden hardware failure.  Everything that is known to be saved to disk
is there, you can verify (with absolute certainty) whether data is ok
or not, and you have backup copies of damaged files.  In the old days
you'd need to be reverting to tape backups for both of these, with
potentially hours of downtime before you even know where you are.
Achieving that in a few seconds (or minutes) is a massive step
forwards.

 There are already people praising ZFS' ability to safeguard their data, and
 the way it recovers even after system crashes or when hardware has gone
 wrong.

Yes there are, but the majority of these are praising the ability of
ZFS checksums to detect bad data, and to repair it when you have
redundancy in your pool.  I've not seen that many cases of people
praising ZFS' recovery ability - uberblock problems seem to have a
nasty habit of leaving you with tons of good, checksummed data on a
pool that you can't get to, and while many hardware problems are dealt
with, others can hang your entire pool.



 Bob
 ==
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-13 Thread Ross Smith
On Fri, Feb 13, 2009 at 8:24 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Fri, 13 Feb 2009, Ross Smith wrote:

 You have to consider that even with improperly working hardware, ZFS
 has been checksumming data, so if that hardware has been working for
 any length of time, you *know* that the data on it is good.

 You only know this if the data has previously been read.

 Assume that the device temporarily stops pysically writing, but otherwise
 responds normally to ZFS.  Then the device starts writing again (including a
 recent uberblock), but with a large gap in the writes.  Then the system
 loses power, or crashes.  What happens then?

Well in that case you're screwed, but if ZFS is known to handle even
corrupted pools automatically, when that happens the immediate
response on the forums is going to be something really bad has
happened to your hardware, followed by troubleshooting to find out
what.  Instead of the response now, where we all know there's every
chance the data is ok, and just can't be gotten to without zdb.

Also, that's a pretty extreme situation since you'd need a device that
is being written to but not read from to fail in this exact way.  It
also needs to have no scrubbing being run, so the problem has remained
undetected.

However, even in that situation, if we assume that it happened and
that these recovery tools are available, ZFS will either report that
your pool is seriously corrupted, indicating a major hardware problem
(and ZFS can now state this with some confidence), or ZFS will be able
to open a previous uberblock, mount your pool and begin a scrub, at
which point all your missing writes will be found too and reported.

And then you can go back to your snapshots.  :-D



 Bob
 ==
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-13 Thread Ross Smith
On Fri, Feb 13, 2009 at 8:24 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Fri, 13 Feb 2009, Ross Smith wrote:

 You have to consider that even with improperly working hardware, ZFS
 has been checksumming data, so if that hardware has been working for
 any length of time, you *know* that the data on it is good.

 You only know this if the data has previously been read.

 Assume that the device temporarily stops pysically writing, but otherwise
 responds normally to ZFS.  Then the device starts writing again (including a
 recent uberblock), but with a large gap in the writes.  Then the system
 loses power, or crashes.  What happens then?

Hey Bob,

Thinking about this a bit more, you've given me an idea:  Would it be
worth ZFS occasionally reading previous uberblocks from the pool, just
to check they are there and working ok?

I wonder if you could do this after a few uberblocks have been
written.  It would seem to be a good way of catching devices that
aren't writing correctly early on, as well as a way of guaranteeing
that previous uberblocks are available to roll back to should a write
go wrong.

I wonder what the upper limits for this kind of write failure is going
to be.  I've seen 30 second delays mentioned in this thread.  How
often are uberblocks written?  Is there any guarantee that we'll
always have more than 30 seconds worth of uberblocks on a drive?
Should ZFS be set so that it keeps either a given number of
uberblocks, or 5 minutes worth of uberblocks, whichever is the larger?

Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-13 Thread Ross Smith
You don't, but that's why I was wondering about time limits.  You have
to have a cut off somewhere, but if you're checking the last few
minutes of uberblocks that really should cope with a lot.  It seems
like a simple enough thing to implement, and if a pool still gets
corrupted with these checks in place, you can absolutely, positively
blame it on the hardware.  :D

However, I've just had another idea.  Since the uberblocks are pretty
vital in recovering a pool, and I believe it's a fair bit of work to
search the disk to find them.  Might it be a good idea to allow ZFS to
store uberblock locations elsewhere for recovery purposes?

This could be as simple as a USB stick plugged into the server, a
separate drive, or a network server.  I guess even the ZIL device
would work if it's separate hardware.  But knowing the locations of
the uberblocks would save yet more time should recovery be needed.



On Fri, Feb 13, 2009 at 8:59 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Fri, 13 Feb 2009, Ross Smith wrote:

 Thinking about this a bit more, you've given me an idea:  Would it be
 worth ZFS occasionally reading previous uberblocks from the pool, just
 to check they are there and working ok?

 That sounds like a good idea.  However, how do you know for sure that the
 data returned is not returned from a volatile cache?  If the hardware is
 ignoring cache flush requests, then any data returned may be from a volatile
 cache.

 Bob
 ==
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

2009-02-12 Thread Ross Smith
Heh, yeah, I've thought the same kind of thing in the past.  The
problem is that the argument doesn't really work for system admins.

As far as I'm concerned, the 7000 series is a new hardware platform,
with relatively untested drivers, running a software solution that I
know is prone to locking up when hardware faults are handled badly by
drivers.  Fair enough, that actual solution is out of our price range,
but I would still be very dubious about purchasing it.  At the very
least I'd be waiting a year for other people to work the kinks out of
the drivers.

Which is a shame, because ZFS has so many other great features it's
easily our first choice for a storage platform.  The one and only
concern we have is its reliability.  We have snv_106 running as a test
platform now.  If I felt I could trust ZFS 100% I'd roll it out
tomorrow.



On Thu, Feb 12, 2009 at 4:25 PM, Tim t...@tcsac.net wrote:


 On Thu, Feb 12, 2009 at 9:25 AM, Ross myxi...@googlemail.com wrote:

 This sounds like exactly the kind of problem I've been shouting about for
 6 months or more.  I posted a huge thread on availability on these forums
 because I had concerns over exactly this kind of hanging.

 ZFS doesn't trust hardware or drivers when it comes to your data -
 everything is checksummed.  However, when it comes to seeing whether devices
 are responding, and checking for faults, it blindly trusts whatever the
 hardware or driver tells it.  Unfortunately, that means ZFS is vulnerable to
 any unexpected bug or error in the storage chain.  I've encountered at least
 two hang conditions myself (and I'm not exactly a heavy user), and I've seen
 several others on the forums, including a few on x4500's.

 Now, I do accept that errors like this will be few and far between, but
 they still means you have the risk that a badly handled error condition can
 hang your entire server, instead of just one drive.  Solaris can handle
 things like CPU's or Memory going faulty for crying out loud.  Its raid
 storage system had better be able to handle a disk failing.

 Sun seem to be taking the approach that these errors should be dealt with
 in the driver layer.  And while that's technically correct, a reliable
 storage system had damn well better be able to keep the server limping along
 while we wait for patches to the storage drivers.

 ZFS absolutely needs an error handling layer between the volume manager
 and the devices.  It needs to timeout items that are not responding, and it
 needs to drop bad devices if they could cause problems elsewhere.

 And yes, I'm repeating myself, but I can't understand why this is not
 being acted on.  Right now the error checking appears to be such that if an
 unexpected, or badly handled error condition occurs in the driver stack, the
 pool or server hangs.  Whereas the expected behavior would be for just one
 drive to fail.  The absolute worst case scenario should be that an entire
 controller has to be taken offline (and I would hope that the controllers in
 an x4500 would be running separate instances of the driver software).

 None one of those conditions should be fatal, good storage designs cope
 with them all, and good error handling at the ZFS layer is absolutely vital
 when you have projects like Comstar introducing more and more types of
 storage device for ZFS to work with.

 Each extra type of storage introduces yet more software into the equation,
 and increases the risk of finding faults like this.  While they will be
 rare, they should be expected, and ZFS should be designed to handle them.


 I'd imagine for the exact same reason short-stroking/right-sizing isn't a
 concern.

 We don't have this problem in the 7000 series, perhaps you should buy one
 of those.

 ;)

 --Tim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-12 Thread Ross Smith
That would be the ideal, but really I'd settle for just improved error
handling and recovery for now.  In the longer term, disabling write
caching by default for USB or Firewire drives might be nice.


On Thu, Feb 12, 2009 at 8:35 PM, Gary Mills mi...@cc.umanitoba.ca wrote:
 On Thu, Feb 12, 2009 at 11:53:40AM -0500, Greg Palmer wrote:
 Ross wrote:
 I can also state with confidence that very, very few of the 100 staff
 working here will even be aware that it's possible to unmount a USB volume
 in windows.  They will all just pull the plug when their work is saved,
 and since they all come to me when they have problems, I think I can
 safely say that pulling USB devices really doesn't tend to corrupt
 filesystems in Windows.  Everybody I know just waits for the light on the
 device to go out.
 
 The key here is that Windows does not cache writes to the USB drive
 unless you go in and specifically enable them. It caches reads but not
 writes. If you enable them you will lose data if you pull the stick out
 before all the data is written. This is the type of safety measure that
 needs to be implemented in ZFS if it is to support the average user
 instead of just the IT professionals.

 That implies that ZFS will have to detect removable devices and treat
 them differently than fixed devices.  It might have to be an option
 that can be enabled for higher performance with reduced data security.

 --
 -Gary Mills--Unix Support--U of M Academic Computing and Networking-

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Data loss bug - sidelined??

2009-02-06 Thread Ross Smith
I can check on Monday, but the system will probably panic... which
doesn't really help :-)

Am I right in thinking failmode=wait is still the default?  If so,
that should be how it's set as this testing was done on a clean
install of snv_106.  From what I've seen, I don't think this is a
problem with the zfs failmode.  It's more of an issue of what happens
in the period *before* zfs realises there's a problem and applies the
failmode.

This time there was just a window of a couple of minutes while
commands would continue.  In the past I've managed to stretch it out
to hours.

To me the biggest problems are:
- ZFS accepting writes that don't happen (from both before and after
the drive is removed)
- No logging or warning of this in zpool status

I appreciate that if you're using cache, some data loss is pretty much
inevitable when a pool fails, but that should be a few seconds worth
of data at worst, not minutes or hours worth.

Also, if a pool fails completely and there's data in the cache that
hasn't been committed to disk, it would be great if Solaris could
respond by:

- immediately dumping the cache to any (all?) working storage
- prompting the user to fix the pool, or save the cache before
powering down the system

Ross


On Fri, Feb 6, 2009 at 5:49 PM, Richard Elling richard.ell...@gmail.com wrote:
 Ross, this is a pretty good description of what I would expect when
 failmode=continue. What happens when failmode=panic?
 -- richard


 Ross wrote:

 Ok, it's still happening in snv_106:

 I plugged a USB drive into a freshly installed system, and created a
 single disk zpool on it:
 # zpool create usbtest c1t0d0

 I opened the (nautilus?) file manager in gnome, and copied the /etc/X11
 folder to it.  I then copied the /etc/apache folder to it, and at 4:05pm,
 disconnected the drive.

 At this point there are *no* warnings on screen, or any indication that
 there is a problem.  To check that the pool was still working, I created
 duplicates of the two folders on that drive.  That worked without any
 errors, although the drive was physically removed.

 4:07pm
 I ran zpool status, the pool is actually showing as unavailable, so at
 least that has happened faster than my last test.

 The folder is still open in gnome, however any attempt to copy files to or
 from it just hangs the file transfer operation window.

 4:09pm
 /usbtest is still visible in gnome
 Also, I can still open a console and use the folder:

 # cd usbtest
 # ls
 X11X11 (copy) apache apache (copy)

 I also tried:
 # mv X11 X11-test

 That hung, but I saw the X11 folder disappear from the graphical file
 manager, so the system still believes something is working with this pool.

 The main GUI is actually a little messed up now.  The gnome file manager
 window looking at the /usbtest folder has hung.  Also, right-clicking the
 desktop to open a new terminal hangs, leaving the right-click menu on
 screen.

 The main menu still works though, and I can still open a new terminal.

 4:19pm
 Commands such as ls are finally hanging on the pool.

 At this point I tried to reboot, but it appears that isn't working.  I
 used system monitor to kill everything I had running and tried again, but
 that didn't help.

 I had to physically power off the system to reboot.

 After the reboot, as expected, /usbtest still exists (even though the
 drive is disconnected).  I removed that folder and connected the drive.

 ZFS detects the insertion and automounts the drive, but I find that
 although the pool is showing as online, and the filesystem shows as mounted
 at /usbtest.  But the /usbtest directory doesn't exist.

 I had to export and import the pool to get it available, but as expected,
 I've lost data:
 # cd usbtest
 # ls
 X11

 even worse, zfs is completely unaware of this:
 # zpool status -v usbtest
  pool: usbtest
  state: ONLINE
  scrub: none requested
 config:

NAMESTATE READ WRITE CKSUM
usbtest ONLINE   0 0 0
  c1t0d0ONLINE   0 0 0

 errors: No known data errors


 So in summary, there are a good few problems here, many of which I've
 already reported as bugs:

 1. ZFS still accepts read and write operations for a faulted pool, causing
 data loss that isn't necessarily reported by zpool status.
 2. Even after writes start to hang, it's still possible to continue
 reading data from a faulted pool.
 3. A faulted pool causes unwanted side effects in the GUI, making the
 system hard to use, and impossible to reboot.
 4. After a hard reset, ZFS does not recover cleanly.  Unused mountpoints
 are left behind.
 5. Automatic mounting of pools doesn't seem to work reliably.
 6. zfs status doesn't inform of any problems mounting the pool.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Data loss bug - sidelined??

2009-02-06 Thread Ross Smith
Something to do with cache was my first thought.  It seems to be able
to read and write from the cache quite happily for some time,
regardless of whether the pool is live.

If you're reading or writing large amounts of data, zfs starts
experiencing IO faults and offlines the pool pretty quickly.  If
you're just working with small datasets, or viewing files that you've
recently opened, it seems you can stretch it out for quite a while.

But yes, it seems that it doesn't enter failmode until the cache is
full.  I would expect it to hit this within 5 seconds (since I believe
that is how often the cache should be writing).


On Fri, Feb 6, 2009 at 7:04 PM, Brent Jones br...@servuhome.net wrote:
 On Fri, Feb 6, 2009 at 10:50 AM, Ross Smith myxi...@googlemail.com wrote:
 I can check on Monday, but the system will probably panic... which
 doesn't really help :-)

 Am I right in thinking failmode=wait is still the default?  If so,
 that should be how it's set as this testing was done on a clean
 install of snv_106.  From what I've seen, I don't think this is a
 problem with the zfs failmode.  It's more of an issue of what happens
 in the period *before* zfs realises there's a problem and applies the
 failmode.

 This time there was just a window of a couple of minutes while
 commands would continue.  In the past I've managed to stretch it out
 to hours.

 To me the biggest problems are:
 - ZFS accepting writes that don't happen (from both before and after
 the drive is removed)
 - No logging or warning of this in zpool status

 I appreciate that if you're using cache, some data loss is pretty much
 inevitable when a pool fails, but that should be a few seconds worth
 of data at worst, not minutes or hours worth.

 Also, if a pool fails completely and there's data in the cache that
 hasn't been committed to disk, it would be great if Solaris could
 respond by:

 - immediately dumping the cache to any (all?) working storage
 - prompting the user to fix the pool, or save the cache before
 powering down the system

 Ross


 On Fri, Feb 6, 2009 at 5:49 PM, Richard Elling richard.ell...@gmail.com 
 wrote:
 Ross, this is a pretty good description of what I would expect when
 failmode=continue. What happens when failmode=panic?
 -- richard


 Ross wrote:

 Ok, it's still happening in snv_106:

 I plugged a USB drive into a freshly installed system, and created a
 single disk zpool on it:
 # zpool create usbtest c1t0d0

 I opened the (nautilus?) file manager in gnome, and copied the /etc/X11
 folder to it.  I then copied the /etc/apache folder to it, and at 4:05pm,
 disconnected the drive.

 At this point there are *no* warnings on screen, or any indication that
 there is a problem.  To check that the pool was still working, I created
 duplicates of the two folders on that drive.  That worked without any
 errors, although the drive was physically removed.

 4:07pm
 I ran zpool status, the pool is actually showing as unavailable, so at
 least that has happened faster than my last test.

 The folder is still open in gnome, however any attempt to copy files to or
 from it just hangs the file transfer operation window.

 4:09pm
 /usbtest is still visible in gnome
 Also, I can still open a console and use the folder:

 # cd usbtest
 # ls
 X11X11 (copy) apache apache (copy)

 I also tried:
 # mv X11 X11-test

 That hung, but I saw the X11 folder disappear from the graphical file
 manager, so the system still believes something is working with this pool.

 The main GUI is actually a little messed up now.  The gnome file manager
 window looking at the /usbtest folder has hung.  Also, right-clicking the
 desktop to open a new terminal hangs, leaving the right-click menu on
 screen.

 The main menu still works though, and I can still open a new terminal.

 4:19pm
 Commands such as ls are finally hanging on the pool.

 At this point I tried to reboot, but it appears that isn't working.  I
 used system monitor to kill everything I had running and tried again, but
 that didn't help.

 I had to physically power off the system to reboot.

 After the reboot, as expected, /usbtest still exists (even though the
 drive is disconnected).  I removed that folder and connected the drive.

 ZFS detects the insertion and automounts the drive, but I find that
 although the pool is showing as online, and the filesystem shows as mounted
 at /usbtest.  But the /usbtest directory doesn't exist.

 I had to export and import the pool to get it available, but as expected,
 I've lost data:
 # cd usbtest
 # ls
 X11

 even worse, zfs is completely unaware of this:
 # zpool status -v usbtest
  pool: usbtest
  state: ONLINE
  scrub: none requested
 config:

NAMESTATE READ WRITE CKSUM
usbtest ONLINE   0 0 0
  c1t0d0ONLINE   0 0 0

 errors: No known data errors


 So in summary, there are a good few problems here, many of which I've
 already reported as bugs:

 1. ZFS

Re: [zfs-discuss] Any way to set casesensitivity=mixed on the main pool?

2009-02-04 Thread Ross Smith
It's not intuitive because when you know that -o sets options, an
error message saying that it's not a valid property makes you think
that it's not possible to do what you're trying.

Documented and intuitive are very different things.  I do appreciate
that the details are there in the manuals, but for items like this
where it's very easy to pick the wrong one, it helps if the commands
can work with you.

The difference between -o and -O is pretty subtle, I just think that
extra sentence in the error message could save a lot of frustration
when people get mixed up.

Ross



On Wed, Feb 4, 2009 at 11:14 AM, Darren J Moffat
darr...@opensolaris.org wrote:
 Ross wrote:

 Good god.  Talk about non intuitive.  Thanks Darren!

 Why isn't that intuitive ?  It is even documented in the man page.

 zpool create [-fn] [-o property=value] ... [-O file-system-
 property=value] ... [-m mountpoint] [-R root] pool vdev ...


 Is it possible for me to suggest a quick change to the zpool error message
 in solaris?  Should I file that as an RFE?  I'm just wondering if the error
 message could be changed to something like:
 property 'casesensitivity' is not a valid pool property.  Did you mean to
 use -O?

 It's just a simple change, but it makes it obvious that it can be done,
 instead of giving the impression that it's not possible.

 Feel free to log the RFE in defect.opensolaris.org.

 --
 Darren J Moffat

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD drives in Sun Fire X4540 or X4500 for dedicated ZIL device

2009-01-23 Thread Ross Smith
That's my understanding too.  One (STEC?) drive as a write cache,
basically a write optimised SSD.  And cheaper, larger, read optimised
SSD's for the read cache.

I thought it was an odd strategy until I read into SSD's a little more
and realised you really do have to think about your usage cases with
these.  SSD's are very definitely not all alike.


On Fri, Jan 23, 2009 at 4:33 PM, Greg Mason gma...@msu.edu wrote:
 If i'm not mistaken (and somebody please correct me if i'm wrong), the Sun
 7000 series storage appliances (the Fishworks boxes) use enterprise SSDs,
 with dram caching. One such product is made by STEC.

 My understanding is that the Sun appliances use one SSD for the ZIL, and one
 as a read cache. For the 7210 (which is basically a Sun Fire X4540), that
 gives you 46 disks and 2 SSDs.

 -Greg


 Bob Friesenhahn wrote:

 On Thu, 22 Jan 2009, Ross wrote:

 However, now I've written that, Sun use SATA (SAS?) SSD's in their high
 end fishworks storage, so I guess it definately works for some use cases.

 But the fishworks (Fishworks is a development team, not a product) write
 cache device is not based on FLASH.  It is based on DRAM.  The difference is
 like night and day. Apparently there can also be a read cache which is based
 on FLASH.

 Bob
 ==
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Verbose Information from zfs send -v snapshot

2009-01-16 Thread Nick Smith
What 'verbose information' is reported by the zfs send -v snapshot contain?

Also on Solaris 10u6 I don't get any output at all - is this a bug?

Regards,

Nick
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs list improvements?

2009-01-10 Thread Ross Smith
Hmm... that's a tough one.  To me, it's a trade off either way, using
a -r parameter to specify the depth for zfs list feels more intuitive
than adding extra commands to modify the -r behaviour, but I can see
your point.

But then, using -c or -d means there's an optional parameter for zfs
list that you don't have in the other commands anyway.  And would you
have to use -c or -d with -r, or would they work on their own,
providing two ways to achieve very similar functionality.

Also, now you've mentioned that you want to keep things consistent
among all the commands, keeping -c and -d free becomes more important
to me.  You don't know if you might want to use these for another
command later on.

It sounds to me that whichever way you implement it there's going to
be some potential for confusion, but personally I'd stick with using
-r.  It leaves you with a single syntax for viewing children.  The -r
on the other commands can be modified to give an error message if they
don't support this extra parameter, and it leaves both -c and -d free
to use later on.

Ross



On Fri, Jan 9, 2009 at 7:16 PM, Richard Morris - Sun Microsystems -
Burlington United States richard.mor...@sun.com wrote:
 On 01/09/09 01:44, Ross wrote:

 Can I ask why we need to use -c or -d at all?  We already have -r to
 recursively list children, can't we add an optional depth parameter to that?

 You then have:
 zfs list : shows current level (essentially -r 0)
 zfs list -r : shows all levels (infinite recursion)
 zfs list -r 2 : shows 2 levels of children

 An optional depth argument to -r has already been suggested:
 http://mail.opensolaris.org/pipermail/zfs-discuss/2009-January/054241.html

 However, other zfs subcommands such as destroy, get, rename, and snapshot
 also provide -r options without optional depth arguments.  And its probably
 good to keep the zfs subcommand option syntax consistent.  On the other
 hand,
 if all of the zfs subcommands were modified to accept an optional depth
 argument
 to -r, then this would not be an issue.  But, for example, the top level(s)
 of
 datasets cannot be destroyed if that would leave orphaned datasets.

 BTW, when no dataset is specified, zfs list is the same as zfs list -r
 (infinite
 recursion).  When a dataset is specified then it shows only the current
 level.

 Does anyone have any non-theoretical situations where a depth option other
 than
 1 or 2 would be used?  Are scripts being used to work around this problem?

 -- Rich









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs destroy is taking a long time...

2009-01-08 Thread David Smith
I was wondering if anyone has any experience with how long a zfs destroy of 
about 40 TB should take?  So far, it has been about an hour...  Is there any 
good way to tell if it is working or if it is hung?

Doing a zfs list just hangs.  If you do a more specific zfs list, then it is 
okay... zfs list pool/another-fs

Thanks,

David
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs destroy is taking a long time...

2009-01-08 Thread David Smith
A few more details:

The system is a Sun x4600 running Solaris 10 Update 4.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs destroy is taking a long time...

2009-01-08 Thread David W. Smith

On Thu, 2009-01-08 at 13:26 -0500, Brian H. Nelson wrote:
 David Smith wrote:
  I was wondering if anyone has any experience with how long a zfs destroy 
  of about 40 TB should take?  So far, it has been about an hour...  Is there 
  any good way to tell if it is working or if it is hung?
 
  Doing a zfs list just hangs.  If you do a more specific zfs list, then it 
  is okay... zfs list pool/another-fs
 
  Thanks,
 
  David

 
 I can't voice to something like 40 TB, but I can share a related story 
 (on Solaris 10u5).
 
 A couple days ago, I tried to zfs destroy a clone of a snapshot of a 191 
 GB zvol. It didn't complete right away, but the machine appeared to 
 continue working on it, so I decided to let it go overnight (it was near 
 the end of the day). Well, by about 4:00 am the next day, the machine 
 had completely ran out of memory and hung. When I came in, I forced a 
 sync from prom to get it back up. While it was booting, it stopped 
 during (I think) the zfs initialization part, where it ran the disks for 
 about 10 minutes before continuing. When the machine was back up, 
 everything appeared to be ok. The clone was still there, although usage 
 had changed to zero.
 
 I ended up patching the machine up to the latest u6 kernel + zfs patch 
 (13-01 + 139579-01). After that, the zfs destroy went off without a 
 hitch.
 
 I turned up bug 6606810 'zfs destroy volume is taking hours to 
 complete' which is supposed to be fixed by 139579-01. I don't know if 
 that was the cause of my issue or not. I've got a 2GB kernel dump if 
 anyone is interested in looking.
 
 -Brian
 

Brian,

Thanks for the reply.  I'll take a look at the 139579-01 patch.  Perhaps
as well a Sun engineer will comment about this issue being fixed with
patches, etc.

David


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.

2008-12-22 Thread Ross Smith
On Fri, Dec 19, 2008 at 6:47 PM, Richard Elling richard.ell...@sun.com wrote:
 Ross wrote:

 Well, I really like the idea of an automatic service to manage
 send/receives to backup devices, so if you guys don't mind, I'm going to
 share some other ideas for features I think would be useful.


 cool.

 One of the first is that you need some kind of capacity management and
 snapshot deletion.  Eventually backup media are going to fill and you need
 to either prompt the user to remove snapshots, or even better, you need to
 manage the media automatically and remove old snapshots to make space for
 new ones.


 I've implemented something like this for a project I'm working on.
 Consider this a research project at this time, though I hope to
 leverage some of the things we learn as we scale up, out, and
 refine the operating procedures.

Way cool :D

 There is a failure mode lurking here.  Suppose you take two sets
 of snapshots: local and remote.  You want to do an incremental
 send, for efficiency.  So you look at the set of snapshots on both
 machines and find the latest, common snapshot.  You will then
 send the list of incrementals from the latest, common through the
 latest snapshot.  On the remote machine, if there are any other
 snapshots not in the list you are sending and newer than the latest,
 common snapshot, then the send/recv will fail.  In practice, this
 means that if you use the zfs-auto-snapshot feature, which will
 automatically destroy older snapshots as it goes (eg. the default
 policy for frequent is take snapshots every 15 minutes, keep 4).

 If you never have an interruption in your snapshot schedule, you
 can merrily cruise along and not worry about this.  But if there is
 an interruption (for maintenance, perhaps) and a snapshot is
 destroyed on the sender, then you also must make sure it gets
 destroyed on the receiver.  I just polished that code yesterday,
 and it seems to work fine... though it makes folks a little nervous.
 Anyone with an operations orientation will recognize that there
 needs to be a good process wrapped around this, but I haven't
 worked through all of the scenarios on the receiver yet.

Very true.  In this context I think this would be fine.  You would
want a warning to pop up saying that a snapshot has been deleted
locally and will have to be overwritten on the backup, but I think
that would be ok.  If necessary you could have a help page explaining
why - essentially this is a copy of your pool, not just a backup of
your files, and to work it needs an accurate copy of your snapshots.
If you wanted to be really fancy, you could have an option for the
user to view the affected files, but I think that's probably over
complicating things.

I don't suppose there's any way the remote snapshot can be cloned /
separated from the pool just in case somebody wanted to retain access
to the files within it?


 I'm thinking that a setup like time slider would work well, where you
 specify how many of each age of snapshot to keep.  But I would want to be
 able to specify different intervals for different devices.

 eg. I might want just the latest one or two snapshots on a USB disk so I
 can take my files around with me.  On a removable drive however I'd be more
 interested in preserving a lot of daily / weekly backups.  I might even have
 an archive drive that I just store monthly snapshots on.

 What would be really good would be a GUI that can estimate how much space
 is going to be taken up for any configuration.  You could use the existing
 snapshots on disk as a guide, and take an average size for each interval,
 giving you average sizes for hourly, daily, weekly, monthly, etc...


 ha ha, I almost blew coffee out my nose ;-)  I'm sure that once
 the forward time-slider functionality is implemented, it will be
 much easier to manage your storage utilization :-)  So, why am
 I giggling?  My wife just remembered that she hadn't taken her
 photos off the camera lately... 8 GByte SD cards are the vehicle
 of evil destined to wreck your capacity planning :-)

Haha, that's a great image, but I've got some food for thought even with this.

If you think about it, even though 8GB sounds a lot, it's barely over
1% of a 500GB drive, so it's not an unmanageable blip as far as
storage goes.

Also, if you're using the default settings for Tim's backups, you'll
be taking snapshots every 15 minutes, hour, day, week and month.  Now,
when you start you're not going to have any sensible averages for your
monthly snapshot sizes, but you're very rapidly going to get a set of
figures for your 15 minute snapshots.

What I would suggest is to use those to extrapolate forwards to give
very rough estimates of usage early on, with warnings as to how rough
these are.  In time these estimates will improve in accuracy, and your
8GB photo 'blip' should be relatively easily incorporated.

What you could maybe do is have a high and low usage estimate shown in
the GUI.  Early on these will be quite a 

Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.

2008-12-18 Thread Ross Smith
 Absolutely.

 The tool shouldn't need to know that the backup disk is accessed via
 USB, or whatever.  The GUI should, however, present devices
 intelligently, not as cXtYdZ!

Yup, and that's easily achieved by simply prompting for a user
friendly name as devices are attached.  Now you could store that
locally, but it would be relatively easy to drop an XML configuration
file on the device too, allowing the same friendly name to be shown
wherever it's connected.

And this is sounding more and more like something I was thinking of
developing myself.  A proper Sun version would be much better though
(not least before I've never developed anything for Solaris!).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.

2008-12-18 Thread Ross Smith
On Thu, Dec 18, 2008 at 7:11 PM, Nicolas Williams
nicolas.willi...@sun.com wrote:
 On Thu, Dec 18, 2008 at 07:05:44PM +, Ross Smith wrote:
  Absolutely.
 
  The tool shouldn't need to know that the backup disk is accessed via
  USB, or whatever.  The GUI should, however, present devices
  intelligently, not as cXtYdZ!

 Yup, and that's easily achieved by simply prompting for a user
 friendly name as devices are attached.  Now you could store that
 locally, but it would be relatively easy to drop an XML configuration
 file on the device too, allowing the same friendly name to be shown
 wherever it's connected.

 I was thinking more something like:

  - find all disk devices and slices that have ZFS pools on them
  - show users the devices and pool names (and UUIDs and device paths in
   case of conflicts)..

I was thinking that device  pool names are too variable, you need to
be reading serial numbers or ID's from the device and link to that.

  - let the user pick one.

  - in the case that the user wants to initialize a drive to be a backup
   you need something more complex.

- one possibility is to tell the user when to attach the desired
  backup device, in which case the GUI can detect the addition and
  then it knows that that's the device to use (but be careful to
  check that the user also owns the device so that you don't pick
  the wrong one on multi-seat systems)

- another is to be much smarter about mapping topology to physical
  slots and present a picture to the user that makes sense to the
  user, so the user can click on the device they want.  This is much
  harder.

I was actually thinking of a resident service.  Tim's autobackup
script was capable of firing off backups when it detected the
insertion of a USB drive, and if you've got something sitting there
monitoring drive insertions you could have it prompt the user when new
drives are detected, asking if they should be used for backups.

Of course, you'll need some settings for this so it's not annoying if
people don't want to use it.  A simple tick box on that pop up dialog
allowing people to say don't ask me again would probably do.

You'd then need a second way to assign drives if the user changed
their mind.  I'm thinking this would be to load the software and
select a drive.  Mapping to physical slots would be tricky, I think
you'd be better with a simple view that simply names the type of
interface, the drive size, and shows any current disk labels.  It
would be relatively easy then to recognise the 80GB USB drive you've
just connected.

Also, because you're formatting these drives as ZFS, you're not
restricted to just storing your backups on them.  You can create a
root pool (to contain the XML files, etc), and the backups can then be
saved to a filesystem within that.

That means the drive then functions as both a removable drive, and as
a full backup for your system.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.

2008-12-18 Thread Ross Smith
 Of course, you'll need some settings for this so it's not annoying if
 people don't want to use it.  A simple tick box on that pop up dialog
 allowing people to say don't ask me again would probably do.

 I would like something better than that.  Don't ask me again sucks
 when much, much later you want to be asked and you don't know how to get the 
 system to ask you.

Only if your UI design doesn't make it easy to discover how to add
devices another way, or turn this setting back on.

My thinking is that this actually won't be the primary way of adding
devices.  It's simply there for ease of use for end users, as an easy
way for them to discover that they can use external drives to backup
their system.

Once you have a backup drive configured, most of the time you're not
going to want to be prompted for other devices.  Users will generally
setup a single external drive for backups, and won't want prompting
every time they insert a USB thumb drive, a digital camera, phone,
etc.

So you need that initial prompt to make the feature discoverable, and
then an easy and obvious way to configure backup devices later.

 You'd then need a second way to assign drives if the user changed
 their mind.  I'm thinking this would be to load the software and
 select a drive.  Mapping to physical slots would be tricky, I think
 you'd be better with a simple view that simply names the type of
 interface, the drive size, and shows any current disk labels.  It
 would be relatively easy then to recognise the 80GB USB drive you've
 just connected.

 Right, so do as I suggested: tell the user to remove the device if it's
 plugged in, then plug it in again.  That way you can known unambiguously
 (unless the user is doing this with more than one device at a time).

That's horrible from a users point of view though.  Possibly worth
having as a last resort, but I'd rather just let the user pick the
device.  This does have potential as a help me find my device
feature though.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.

2008-12-18 Thread Ross Smith
 I was thinking more something like:

  - find all disk devices and slices that have ZFS pools on them
  - show users the devices and pool names (and UUIDs and device paths in
  case of conflicts)..


 I was thinking that device  pool names are too variable, you need to
 be reading serial numbers or ID's from the device and link to that.


 Device names are, but there's no harm in showing them if there's
 something else that's less variable.  Pool names are not very variable
 at all.


 I was thinking of something a little different.  Don't worry about
 devices, because you don't send to a device (rather, send to a pool).
 So a simple list of source file systems and a list of destinations
 would do.  I suppose you could work up something with pictures
 and arrows, like Nautilus, but that might just be more confusing
 than useful.

True, but if this is an end user service, you want something that can
create the filesystem for them on their devices.  An advanced mode
that lets you pick any destination filesystem would be good for
network admins, but for end users they're just going to want to point
this at their USB drive.

 But that is the easy part.  The hard part is dealing with the plethora
 of failure modes...
 -- richard

Heh, my response to this is who cares? :-D

This is a high level service, it's purely concerned with backup
succeeded or backup failed, possibly with an overdue for backup
prompt if you want to help the user manage the backups.

Any other failure modes can be dealt with by the lower level services
or by the user.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-15 Thread Ross Smith
Forgive me for not understanding the details, but couldn't you also
work backwards through the blocks with ZFS and attempt to recreate the
uberblock?

So if you lost the uberblock, could you (memory and time allowing)
start scanning the disk, looking for orphan blocks that aren't
refernced anywhere else and piece together the top of the tree?

Or roll back to a previous uberblock (or a snapshot uberblock), and
then look to see what blocks are on the disk but not referenced
anywhere.  Is there any way to intelligently work out where those
blocks would be linked by looking at how they interact with the known
data?

Of course, rolling back to a previous uberblock would still be a
massive step forward, and something I think would do much to improve
the perception of ZFS as a tool to reliably store data.

You cannot understate the difference to the end user between a file
system that on boot says:
Sorry, can't read your data pool.

With one that says:
Whoops, the uberblock, and all the backups are borked.  Would you
like to roll back to a backup uberblock, or leave the filesystem
offline to repair manually?

As much as anything else, a simple statement explaining *why* a pool
is inaccessible, and saying just how badly things have gone wrong
helps tons.  Being able to recover anything after that is just the
icing on the cake, especially if it can be done automatically.

Ross

PS.  Sorry for the duplicate Casper, I forgot to cc the list.



On Mon, Dec 15, 2008 at 10:30 AM,  casper@sun.com wrote:

I think the problem for me is not that there's a risk of data loss if
a pool becomes corrupt, but that there are no recovery tools
available.  With UFS, people expect that if the worst happens, fsck
will be able to recover their data in most cases.

 Except, of course, that fsck lies.  In fixes the meta data and the
 quality of the rest is unknown.

 Anyone using UFS knows that UFS file corruption are common; specifically,
 when using a UFS root and the system panic's when trying to
 install a device driver, there's a good chance that some files in
 /etc are corrupt. Some were application problems (some code used
 fsync(fileno(fp)); fclose(fp); it doesn't guarantee anything)


With ZFS you have no such tools, yet Victor has on at least two occasions
shown that it's quite possible to recover pools that were completely unusable
(I believe by making use of old / backup copies of the uberblock).

 True; and certainly ZFS should be able backtrack.  But it's
 much more likely to happen automatically then using a recovery
 tool.

 See, fsck could only be written because specific corruption are known
 and the patterns they have.   With ZFS, you can only backup to
 a certain uberblock and the pattern will be a surprise.

 Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-15 Thread Ross Smith
I'm not sure I follow how that can happen, I thought ZFS writes were
designed to be atomic?  They either commit properly on disk or they
don't?


On Mon, Dec 15, 2008 at 6:34 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Mon, 15 Dec 2008, Ross wrote:

 My concern is that ZFS has all this information on disk, it has the
 ability to know exactly what is and isn't corrupted, and it should (at least
 for a system with snapshots) have many, many potential uberblocks to try.
  It should be far, far better than UFS at recovering from these things, but
 for a certain class of faults, when it hits a problem it just stops dead.

 While ZFS knows if a data block is retrieved correctly from disk, a
 correctly retrieved data block does not indicate that the pool isn't
 corrupted.  A block written in the wrong order is a form of corruption.

 Bob
 ==
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cannot mount ZFS volume

2008-12-11 Thread John Smith
Ahhh...I missed the difference between a volume and a FS. That was it...thanks.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] cannot mount ZFS volume

2008-12-10 Thread John Smith
When I create a volume I am unable to mount it locally. I pretty sure it has 
something to do with the other volumes in the same ZFS pool being shared out as 
ISCSI luns.  For some reason ZFS things the base volume is ISCSI. Is there a 
flag that I am missing? Thanks in advanced for the help.

[EMAIL PROTECTED]:~# zpool list
NAME   SIZE   USED  AVAILCAP  HEALTH  ALTROOT
datapool   464G   196G   268G42%  ONLINE  -
rpool 48.8G  4.33G  44.4G 8%  ONLINE  -

[EMAIL PROTECTED]:~# zfs create -V 2g datapool/share
   
[EMAIL PROTECTED]:~# zfs list
NAME USED  AVAIL  REFER  MOUNTPOINT
datapool 352G   105G18K  /datapool
datapool/backup  200G   207G  97.7G  -
datapool/datavol 150G   156G  98.3G  -
datapool/share 2G   107G16K  -

[EMAIL PROTECTED]:~# zfs mount datapool/share
cannot open 'datapool/share': operation not applicable to datasets of this type

[EMAIL PROTECTED]:~# zfs share datapool/share
cannot share 'datapool/share': 'shareiscsi' property not set
set 'shareiscsi' property or use iscsitadm(1M) to share this volume

[EMAIL PROTECTED]:~# zfs get shareiscsi datapool
NAME  PROPERTYVALUE   SOURCE
datapool  shareiscsi  off local

[EMAIL PROTECTED]:~# zfs get shareiscsi datapool/share
NAMEPROPERTYVALUE   SOURCE
datapool/share  shareiscsi  off inherited from datapool

[EMAIL PROTECTED]:~# zfs set sharenfs=on datapool/share
cannot set property for 'datapool/share': 'sharenfs' does not apply to datasets 
of this type
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs not yet suitable for HA applications?

2008-12-05 Thread Ross Smith
Hi Dan, replying in line:

On Fri, Dec 5, 2008 at 9:19 PM, David Anderson [EMAIL PROTECTED] wrote:
 Trying to keep this in the spotlight. Apologies for the lengthy post.

Heh, don't apologise, you should see some of my posts... o_0

 I'd really like to see features as described by Ross in his summary of the
 Availability: ZFS needs to handle disk removal / driver failure better
  (http://www.opensolaris.org/jive/thread.jspa?messageID=274031#274031 ).
 I'd like to have these/similar features as well. Has there already been
 internal discussions regarding adding this type of functionality to ZFS
 itself, and was there approval, disapproval or no decision?

 Unfortunately my situation has put me in urgent need to find workarounds in
 the meantime.

 My setup: I have two iSCSI target nodes, each with six drives exported via
 iscsi (Storage Nodes). I have a ZFS Node that logs into each target from
 both Storage Nodes and creates a mirrored Zpool with one drive from each
 Storage Node comprising each half of the mirrored vdevs (6 x 2-way mirrors).

 My problem: If a Storage Node crashes completely, is disconnected from the
 network, iscsitgt core dumps, a drive is pulled, or a drive has a problem
 accessing data (read retries), then my ZFS Node hangs while ZFS waits
 patiently for the layers below to report a problem and timeout the devices.
 This can lead to a roughly 3 minute or longer halt when reading OR writing
 to the Zpool on the ZFS node. While this is acceptable in certain
 situations, I have a case where my availability demand is more severe.

 My goal: figure out how to have the zpool pause for NO LONGER than 30
 seconds (roughly within a typical HTTP request timeout) and then issue
 reads/writes to the good devices in the zpool/mirrors while the other side
 comes back online or is fixed.

 My ideas:
  1. In the case of the iscsi targets disappearing (iscsitgt core dump,
 Storage Node crash, Storage Node disconnected from network), I need to lower
 the iSCSI login retry/timeout values. Am I correct in assuming the only way
 to accomplish this is to recompile the iscsi initiator? If so, can someone
 help point me in the right direction (I have never compiled ONNV sources -
 do I need to do this or can I just recompile the iscsi initiator)?

I believe it's possible to just recompile the initiator and install
the new driver.  I have some *very* rough notes that were sent to me
about a year ago, but I've no experience compiling anything in
Solaris, so don't know how useful they will be.  I'll try to dig them
out in case they're useful.


   1.a. I'm not sure in what Initiator session states iscsi_sess_max_delay is
 applicable - only for the initial login, or also in the case of reconnect?
 Ross, if you still have your test boxes available, can you please try
 setting set iscsi:iscsi_sess_max_delay = 5 in /etc/system, reboot and try
 failing your iscsi vdevs again? I can't find a case where this was tested
 quick failover.

Will gladly have a go at this on Monday.

1.b. I would much prefer to have bug 649 addressed and fixed rather
 than having to resort to recompiling the iscsi initiator (if
 iscsi_sess_max_delay) doesn't work. This seems like a trivial feature to
 implement. How can I sponsor development?

  2. In the case of the iscsi target being reachable, but the physical disk
 is having problems reading/writing data (retryable events that take roughly
 60 seconds to timeout), should I change the iscsi_rx_max_window tunable with
 mdb? Is there a tunable for iscsi_tx? Ross, I know you tried this recently
 in the thread referenced above (with value 15), which resulted in a 60
 second hang. How did you offline the iscsi vol to test this failure? Unless
 iscsi uses a multiple of the value for retries, then maybe the way you
 failed the disk caused the iscsi system to follow a different failure path?
 Unfortunately I don't know of a way to introduce read/write retries to a
 disk while the disk is still reachable and presented via iscsitgt, so I'm
 not sure how to test this.

So far I've just been shutting down the Solaris box hosting the iSCSI
target.  Next step will involve pulling some virtual cables.
Unfortunately I don't think I've got a physical box handy to test
drive failures right now, but my previous testing (of simply pulling
drives) showed that it can be hit and miss as to how well ZFS detects
these types of 'failure'.

Like you I don't know yet how to simulate failures, so I'm doing
simple tests right now, offlining entire drives or computers.
Unfortunately I've found more than enough problems with just those
tests to keep me busy.


2.a With the fix of
 http://bugs.opensolaris.org/view_bug.do?bug_id=6518995 , we can set
 sd_retry_count along with sd_io_time to cause I/O failure when a command
 takes longer than sd_retry_count * sd_io_time. Can (or should) these
 tunables be set on the imported iscsi disks in the ZFS Node, or can/should
 they be applied only to the local disk on 

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-03 Thread Ross Smith
Yeah, thanks Maurice, I just saw that one this afternoon.  I guess you
can't reboot with iscsi full stop... o_0

And I've seen the iscsi bug before (I was just too lazy to look it up
lol), I've been complaining about that since February.

In fact it's been a bad week for iscsi here, I've managed to crash the
iscsi client twice in the last couple of days too (full kernel dump
crashes), so I'll be filing a bug report on that tomorrow morning when
I get back to the office.

Ross


On Wed, Dec 3, 2008 at 7:39 PM, Maurice Volaski [EMAIL PROTECTED] wrote:
 2.  With iscsi, you can't reboot with sendtargets enabled, static
 discovery still seems to be the order of the day.

 I'm seeing this problem with static discovery:
 http://bugs.opensolaris.org/view_bug.do?bug_id=6775008.

 4.  iSCSI still has a 3 minute timeout, during which time your pool will
 hang, no matter how many redundant drives you have available.

 This is CR 649, http://bugs.opensolaris.org/view_bug.do?bug_id=649,
 which is separate from the boot time timeout, though, and also one that Sun
 so far has been unable to fix!
 --

 Maurice Volaski, [EMAIL PROTECTED]
 Computing Support, Rose F. Kennedy Center
 Albert Einstein College of Medicine of Yeshiva University

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross Smith
Hey folks,

I've just followed up on this, testing iSCSI with a raided pool, and
it still appears to be struggling when a device goes offline.

 I don't see how this could work except for mirrored pools.  Would that
 carry enough market to be worthwhile?
 -- richard


 I have to admit, I've not tested this with a raided pool, but since
 all ZFS commands hung when my iSCSI device went offline, I assumed
 that you would get the same effect of the pool hanging if a raid-z2
 pool is waiting for a response from a device.  Mirrored pools do work
 particularly well with this since it gives you the potential to have
 remote mirrors of your data, but if you had a raid-z2 pool, you still
 wouldn't want that hanging if a single device failed.


 zpool commands hanging is CR6667208, and has been fixed in b100.
 http://bugs.opensolaris.org/view_bug.do?bug_id=6667208

 I will go and test the raid scenario though on a current build, just to be
 sure.


 Please.
 -- richard


I've just created a pool using three snv_103 iscsi Targets, with a
fourth install of snv_103 collating those targets into a raidz pool,
and sharing that out over CIFS.

To test the server, while transferring files from a windows
workstation, I powered down one of the three iSCSI targets.  It took a
few minutes to shutdown, but once that happened the windows copy
halted with the error:
The specified network name is no longer available.

At this point, the zfs admin tools still work fine (which is a huge
improvement, well done!), but zpool status still reports that all
three devices are online.

A minute later, I can open the share again, and start another copy.

Thirty seconds after that, zpool status finally reports that the iscsi
device is offline.

So it looks like we have the same problems with that 3 minute delay,
with zpool status reporting wrong information, and the CIFS service
having problems tool.

At this point I restarted the iSCSI target, but had problems bringing
it back online.  It appears there's a bug in the initiator, but it's
easily worked around:
http://www.opensolaris.org/jive/thread.jspa?messageID=312981#312981

What was great was that as soon as the iSCSI initiator reconnected,
ZFS started resilvering.

What might not be so great is the fact that all three devices are
showing that they've been resilvered:

# zpool status
  pool: iscsipool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 0h2m with 0 errors on Tue Dec  2 11:04:10 2008
config:

NAME   STATE READ WRITE CKSUM
iscsipool  ONLINE   0 0 0
  raidz1   ONLINE   0 0 0
c2t600144F04933FF6C5056967AC800d0  ONLINE   0
0 0  179K resilvered
c2t600144F04934FAB35056964D9500d0  ONLINE   5
9.88K 0  311M resilvered
c2t600144F04934119E50569675FF00d0  ONLINE   0
0 0  179K resilvered

errors: No known data errors

It's proving a little hard to know exactly what's happening when,
since I've only got a few seconds to log times, and there are delays
with each step.  However, I ran another test using robocopy and was
able to observe the behaviour a little more closely:

Test 2:  Using robocopy for the transfer, and iostat plus zpool status
on the server

10:46:30 - iSCSI server shutdown started
10:52:20 - all drives still online according to zpool status
10:53:30 - robocopy error - The specified network name is no longer available
 - zpool status shows all three drives as online
 - zpool iostat appears to have hung, taking much longer than the 30s
specified to return a result
 - robocopy is now retrying the file, but appears to have hung
10:54:30 - robocopy, CIFS and iostat all start working again, pretty
much simultaneously
 - zpool status now shows the drive as offline

I could probably do with using DTrace to get a better look at this,
but I haven't learnt that yet.  My guess as to what's happening would
be:

- iSCSI target goes offline
- ZFS will not be notified for 3 minutes, but I/O to that device is
essentially hung
- CIFS times out (I suspect this is on the client side with around a
30s timeout, but I can't find the timeout documented anywhere).
- zpool iostat is now waiting, I may be wrong but this doesn't appear
to have benefited from the changes to zpool status
- After 3 minutes, the iSCSI drive goes offline.  The pool carries on
with the remaining two drives, CIFS carries on working, iostat carries
on working.  zpool status however is still out of date.
- zpool status eventually catches up, and reports that the drive has
gone 

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross Smith
Hi Richard,

Thanks, I'll give that a try.  I think I just had a kernel dump while
trying to boot this system back up though, I don't think it likes it
if the iscsi targets aren't available during boot.  Again, that rings
a bell, so I'll go see if that's another known bug.

Changing that setting on the fly didn't seem to help, if anything
things are worse this time around.  I changed the timeout to 15
seconds, but didn't restart any services:

# echo iscsi_rx_max_window/D | mdb -k
iscsi_rx_max_window:
iscsi_rx_max_window:180
# echo iscsi_rx_max_window/W0t15 | mdb -kw
iscsi_rx_max_window:0xb4=   0xf
# echo iscsi_rx_max_window/D | mdb -k
iscsi_rx_max_window:
iscsi_rx_max_window:15

After making those changes, and repeating the test, offlining an iscsi
volume hung all the commands running on the pool.  I had three ssh
sessions open, running the following:
# zpool iostats -v iscsipool 10 100
# format  /dev/null
# time zpool status

They hung for what felt a minute or so.
After that, the CIFS copy timed out.

After the CIFS copy timed out, I tried immediately restarting it.  It
took a few more seconds, but restarted no problem.  Within a few
seconds of that restarting, iostat recovered, and format returned it's
result too.

Around 30 seconds later, zpool status reported two drives, paused
again, then showed the status of the third:

# time zpool status
  pool: iscsipool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 0h0m with 0 errors on Tue Dec  2 16:39:21 2008
config:

NAME   STATE READ WRITE CKSUM
iscsipool  ONLINE   0 0 0
  raidz1   ONLINE   0 0 0
c2t600144F04933FF6C5056967AC800d0  ONLINE   0
0 0  15K resilvered
c2t600144F04934FAB35056964D9500d0  ONLINE   0
0 0  15K resilvered
c2t600144F04934119E50569675FF00d0  ONLINE   0
200 0  24K resilvered

errors: No known data errors

real3m51.774s
user0m0.015s
sys 0m0.100s

Repeating that a few seconds later gives:

# time zpool status
  pool: iscsipool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: resilver completed after 0h0m with 0 errors on Tue Dec  2 16:39:21 2008
config:

NAME   STATE READ WRITE CKSUM
iscsipool  DEGRADED 0 0 0
  raidz1   DEGRADED 0 0 0
c2t600144F04933FF6C5056967AC800d0  ONLINE   0
0 0  15K resilvered
c2t600144F04934FAB35056964D9500d0  ONLINE   0
0 0  15K resilvered
c2t600144F04934119E50569675FF00d0  UNAVAIL  3
5.80K 0  cannot open

errors: No known data errors

real0m0.272s
user0m0.029s
sys 0m0.169s




On Tue, Dec 2, 2008 at 3:58 PM, Richard Elling [EMAIL PROTECTED] wrote:

..

 iSCSI timeout is set to 180 seconds in the client code.  The only way
 to change is to recompile it, or use mdb.  Since you have this test rig
 setup, and I don't, do you want to experiment with this timeout?
 The variable is actually called iscsi_rx_max_window so if you do
   echo iscsi_rx_max_window/D | mdb -k
 you should see 180
 Change it using something like:
   echo iscsi_rx_max_window/W0t30 | mdb -kw
 to set it to 30 seconds.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Ross Smith
On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling [EMAIL PROTECTED] wrote:
 Ross wrote:

 Well, you're not alone in wanting to use ZFS and iSCSI like that, and in
 fact my change request suggested that this is exactly one of the things that
 could be addressed:

 The idea is really a two stage RFE, since just the first part would have
 benefits.  The key is to improve ZFS availability, without affecting it's
 flexibility, bringing it on par with traditional raid controllers.

 A.  Track response times, allowing for lop sided mirrors, and better
 failure detection.

 I've never seen a study which shows, categorically, that disk or network
 failures are preceded by significant latency changes.  How do we get
 better failure detection from such measurements?

Not preceded by as such, but a disk or network failure will certainly
cause significant latency changes.  If the hardware is down, there's
going to be a sudden, and very large change in latency.  Sure, FMA
will catch most cases, but we've already shown that there are some
cases where it doesn't work too well (and I would argue that's always
going to be possible when you are relying on so many different types
of driver).  This is there to ensure that ZFS can handle *all* cases.


  Many people have requested this since it would facilitate remote live
 mirrors.


 At a minimum, something like VxVM's preferred plex should be reasonably
 easy to implement.

 B.  Use response times to timeout devices, dropping them to an interim
 failure mode while waiting for the official result from the driver.  This
 would prevent redundant pools hanging when waiting for a single device.


 I don't see how this could work except for mirrored pools.  Would that
 carry enough market to be worthwhile?
 -- richard

I have to admit, I've not tested this with a raided pool, but since
all ZFS commands hung when my iSCSI device went offline, I assumed
that you would get the same effect of the pool hanging if a raid-z2
pool is waiting for a response from a device.  Mirrored pools do work
particularly well with this since it gives you the potential to have
remote mirrors of your data, but if you had a raid-z2 pool, you still
wouldn't want that hanging if a single device failed.

I will go and test the raid scenario though on a current build, just to be sure.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
Hey Jeff,

Good to hear there's work going on to address this.

What did you guys think to my idea of ZFS supporting a waiting for a
response status for disks as an interim solution that allows the pool
to continue operation while it's waiting for FMA or the driver to
fault the drive?

I do appreciate that it's hard to come up with a definative it's dead
Jim answer, and I agree that long term the FMA approach will pay
dividends.  But I still feel this is a good short term solution, and
one that would also compliment your long term plans.

My justification for this is that it seems to me that you can split
disk behavior into two states:
- returns data ok
- doesn't return data ok

And for the state where it's not returning data, you can again split
that in two:
- returns wrong data
- doesn't return data

The first of these is already covered by ZFS with its checksums (with
FMA doing the extra work to fault drives), so it's just the second
that needs immediate attention, and for the life of me I can't think
of any situation that a simple timeout wouldn't catch.

Personally I'd love to see two parameters, allowing this behavior to
be turned on if desired, and allowing timeouts to be configured:

zfs-auto-device-timeout
zfs-auto-device-timeout-fail-delay

The first sets whether to use this feature, and configures the maximum
time ZFS will wait for a response from a device before putting it in a
waiting status.  The second would be optional and is the maximum
time ZFS will wait before faulting a device (at which point it's
replaced by a hot spare).

The reason I think this will work well with the FMA work is that you
can implement this now and have a real improvement in ZFS
availability.  Then, as the other work starts bringing better modeling
for drive timeouts, the parameters can be either removed, or set
automatically by ZFS.

Long term I guess there's also the potential to remove the second
setting if you felt FMA etc ever got reliable enough, but personally I
would always want to have the final fail delay set.  I'd maybe set it
to a long value such as 1-2 minutes to give FMA, etc a fair chance to
find the fault.  But I'd be much happier knowing that the system will
*always* be able to replace a faulty device within a minute or two, no
matter what the FMA system finds.

The key thing is that you're not faulting devices early, so FMA is
still vital.  The idea is purely to let ZFS to keep the pool active by
removing the need for the entire pool to wait on the FMA diagnosis.

As I said before, the driver and firmware are only aware of a single
disk, and I would imagine that FMA also has the same limitation - it's
only going to be looking at a single item and trying to determine
whether it's faulty or not.  Because of that, FMA is going to be
designed to be very careful to avoid false positives, and will likely
take it's time to reach an answer in some situations.

ZFS however has the benefit of knowing more about the pool, and in the
vast majority of situations, it should be possible for ZFS to read or
write from other devices while it's waiting for an 'official' result
from any one faulty component.

Ross


On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick [EMAIL PROTECTED] wrote:
 I think we (the ZFS team) all generally agree with you.  The current
 nevada code is much better at handling device failures than it was
 just a few months ago.  And there are additional changes that were
 made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000)
 product line that will make things even better once the FishWorks team
 has a chance to catch its breath and integrate those changes into nevada.
 And then we've got further improvements in the pipeline.

 The reason this is all so much harder than it sounds is that we're
 trying to provide increasingly optimal behavior given a collection of
 devices whose failure modes are largely ill-defined.  (Is the disk
 dead or just slow?  Gone or just temporarily disconnected?  Does this
 burst of bad sectors indicate catastrophic failure, or just localized
 media errors?)  The disks' SMART data is notoriously unreliable, BTW.
 So there's a lot of work underway to model the physical topology of
 the hardware, gather telemetry from the devices, the enclosures,
 the environmental sensors etc, so that we can generate an accurate
 FMA fault diagnosis and then tell ZFS to take appropriate action.

 We have some of this today; it's just a lot of work to complete it.

 Oh, and regarding the original post -- as several readers correctly
 surmised, we weren't faking anything, we just didn't want to wait
 for all the device timeouts.  Because the disks were on USB, which
 is a hotplug-capable bus, unplugging the dead disk generated an
 interrupt that bypassed the timeout.  We could have waited it out,
 but 60 seconds is an eternity on stage.

 Jeff

 On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:
 But that's exactly the problem Richard:  AFAIK.

 Can you state that absolutely, 

Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
PS.  I think this also gives you a chance at making the whole problem
much simpler.  Instead of the hard question of is this faulty,
you're just trying to say is it working right now?.

In fact, I'm now wondering if the waiting for a response flag
wouldn't be better as possibly faulty.  That way you could use it
with checksum errors too, possibly with settings as simple as errors
per minute or error percentage.  As with the timeouts, you could
have it off by default (or provide sensible defaults), and let
administrators tweak it for their particular needs.

Imagine a pool with the following settings:
- zfs-auto-device-timeout = 5s
- zfs-auto-device-checksum-fail-limit-epm = 20
- zfs-auto-device-checksum-fail-limit-percent = 10
- zfs-auto-device-fail-delay = 120s

That would allow the pool to flag a device as possibly faulty
regardless of the type of fault, and take immediate proactive action
to safeguard data (generally long before the device is actually
faulted).

A device triggering any of these flags would be enough for ZFS to
start reading from (or writing to) other devices first, and should you
get multiple failures, or problems on a non redundant pool, you always
just revert back to ZFS' current behaviour.

Ross





On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick [EMAIL PROTECTED] wrote:
 I think we (the ZFS team) all generally agree with you.  The current
 nevada code is much better at handling device failures than it was
 just a few months ago.  And there are additional changes that were
 made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000)
 product line that will make things even better once the FishWorks team
 has a chance to catch its breath and integrate those changes into nevada.
 And then we've got further improvements in the pipeline.

 The reason this is all so much harder than it sounds is that we're
 trying to provide increasingly optimal behavior given a collection of
 devices whose failure modes are largely ill-defined.  (Is the disk
 dead or just slow?  Gone or just temporarily disconnected?  Does this
 burst of bad sectors indicate catastrophic failure, or just localized
 media errors?)  The disks' SMART data is notoriously unreliable, BTW.
 So there's a lot of work underway to model the physical topology of
 the hardware, gather telemetry from the devices, the enclosures,
 the environmental sensors etc, so that we can generate an accurate
 FMA fault diagnosis and then tell ZFS to take appropriate action.

 We have some of this today; it's just a lot of work to complete it.

 Oh, and regarding the original post -- as several readers correctly
 surmised, we weren't faking anything, we just didn't want to wait
 for all the device timeouts.  Because the disks were on USB, which
 is a hotplug-capable bus, unplugging the dead disk generated an
 interrupt that bypassed the timeout.  We could have waited it out,
 but 60 seconds is an eternity on stage.

 Jeff

 On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:
 But that's exactly the problem Richard:  AFAIK.

 Can you state that absolutely, categorically, there is no failure mode out 
 there (caused by hardware faults, or bad drivers) that won't lock a drive up 
 for hours?  You can't, obviously, which is why we keep saying that ZFS 
 should have this kind of timeout feature.

 For once I agree with Miles, I think he's written a really good writeup of 
 the problem here.  My simple view on it would be this:

 Drives are only aware of themselves as an individual entity.  Their job is 
 to save  restore data to themselves, and drivers are written to minimise 
 any chance of data loss.  So when a drive starts to fail, it makes complete 
 sense for the driver and hardware to be very, very thorough about trying to 
 read or write that data, and to only fail as a last resort.

 I'm not at all surprised that drives take 30 seconds to timeout, nor that 
 they could slow a pool for hours.  That's their job.  They know nothing else 
 about the storage, they just have to do their level best to do as they're 
 told, and will only fail if they absolutely can't store the data.

 The raid controller on the other hand (Netapp / ZFS, etc) knows all about 
 the pool.  It knows if you have half a dozen good drives online, it knows if 
 there are hot spares available, and it *should* also know how quickly the 
 drives under its care usually respond to requests.

 ZFS is perfectly placed to spot when a drive is starting to fail, and to 
 take the appropriate action to safeguard your data.  It has far more 
 information available than a single drive ever will, and should be designed 
 accordingly.

 Expecting the firmware and drivers of individual drives to control the 
 failure modes of your redundant pool is just crazy imo.  You're throwing 
 away some of the biggest benefits of using multiple drives in the first 
 place.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 

Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
No, I count that as doesn't return data ok, but my post wasn't very
clear at all on that.

Even for a write, the disk will return something to indicate that the
action has completed, so that can also be covered by just those two
scenarios, and right now ZFS can lock the whole pool up if it's
waiting for that response.

My idea is simply to allow the pool to continue operation while
waiting for the drive to fault, even if that's a faulty write.  It
just means that the rest of the operations (reads and writes) can keep
working for the minute (or three) it takes for FMA and the rest of the
chain to flag a device as faulty.

For write operations, the data can be safely committed to the rest of
the pool, with just the outstanding writes for the drive left waiting.
 Then as soon as the device is faulted, the hot spare can kick in, and
the outstanding writes quickly written to the spare.

For single parity, or non redundant volumes there's some benefit in
this.  For dual parity pools there's a massive benefit as your pool
stays available, and your data is still well protected.

Ross



On Tue, Nov 25, 2008 at 10:44 AM,  [EMAIL PROTECTED] wrote:


My justification for this is that it seems to me that you can split
disk behavior into two states:
- returns data ok
- doesn't return data ok


 I think you're missing won't write.

 There's clearly a difference between get data from a different copy
 which you can fix but retrying data to a different part of the redundant
 data and writing data: the data which can't be written must be kept
 until the drive is faulted.


 Casper


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
Hmm, true.  The idea doesn't work so well if you have a lot of writes,
so there needs to be some thought as to how you handle that.

Just thinking aloud, could the missing writes be written to the log
file on the rest of the pool?  Or temporarily stored somewhere else in
the pool?  Would it be an option to allow up to a certain amount of
writes to be cached in this way while waiting for FMA, and only
suspend writes once that cache is full?

With a large SSD slog device would it be possible to just stream all
writes to the log?  As a further enhancement, might it be possible to
commit writes to the working drives, and just leave the writes for the
bad drive(s) in the slog (potentially saving a lot of space)?

For pools without log devices, I suspect that you would probably need
the administrator to specify the behavior as I can see several options
depending on the raid level and that pools priorities for data
availability / integrity:

Drive fault write cache settings:
default - pool waits for device, no writes occur until device or spare
comes online
slog - writes are cached to slog device until full, then pool reverts
to default behavior (could this be the default with slog devices
present?)
pool - writes are cached to the pool itself, up to a set maximum, and
are written to the device or spare as soon as possible.  This assumes
a single parity pool with the other devices available.  If the upper
limit is reached, or another devices goes faulty, pool reverts to
default behaviour.

Storing directly to the rest of the pool would probably want to be off
by default on single parity pools, but I would imagine that it could
be on by default on dual parity pools.

Would that be enough to allow writes to continue in most circumstances
while the pool waits for FMA?

Ross



On Tue, Nov 25, 2008 at 10:55 AM,  [EMAIL PROTECTED] wrote:


My idea is simply to allow the pool to continue operation while
waiting for the drive to fault, even if that's a faulty write.  It
just means that the rest of the operations (reads and writes) can keep
working for the minute (or three) it takes for FMA and the rest of the
chain to flag a device as faulty.

 Except when you're writing a lot; 3 minutes can cause a 20GB backlog
 for a single disk.

 Casper


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
 The shortcomings of timeouts have been discussed on this list before. How do
 you tell the difference between a drive that is dead and a path that is just
 highly loaded?

A path that is dead is either returning bad data, or isn't returning
anything.  A highly loaded path is by definition reading  writing
lots of data.  I think you're assuming that these are file level
timeouts, when this would actually need to be much lower level.


 Sounds good - devil, meet details, etc.

Yup, I imagine there are going to be a few details to iron out, many
of which will need looking at by somebody a lot more technical than
myself.

Despite that I still think this is a discussion worth having.  So far
I don't think I've seen any situation where this would make things
worse than they are now, and I can think of plenty of cases where it
would be a huge improvement.

Of course, it also probably means a huge amount of work to implement.
I'm just hoping that it's not prohibitively difficult, and that the
ZFS team see the benefits as being worth it.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
I disagree Bob, I think this is a very different function to that
which FMA provides.

As far as I know, FMA doesn't have access to the big picture of pool
configuration that ZFS has, so why shouldn't ZFS use that information
to increase the reliability of the pool while still using FMA to
handle device failures?

The flip side of the argument is that ZFS already checks the data
returned by the hardware.  You might as well say that FMA should deal
with that too since it's responsible for all hardware failures.

The role of ZFS is to manage the pool, availability should be part and
parcel of that.


On Tue, Nov 25, 2008 at 3:57 PM, Bob Friesenhahn
[EMAIL PROTECTED] wrote:
 On Tue, 25 Nov 2008, Ross Smith wrote:

 Good to hear there's work going on to address this.

 What did you guys think to my idea of ZFS supporting a waiting for a
 response status for disks as an interim solution that allows the pool
 to continue operation while it's waiting for FMA or the driver to
 fault the drive?

 A stable and sane system never comes with two brains.  It is wrong to put
 this sort of logic into ZFS when ZFS is already depending on FMA to make the
 decisions and Solaris already has an infrastructure to handle faults.  The
 more appropriate solution is that this feature should be in FMA.

 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help recovering zfs filesystem

2008-11-07 Thread Nigel Smith
FYI, here are the link to the 'labelfix' utility.
It an attachment to one of Jeff Bonwick's posts on this thread:

http://www.opensolaris.org/jive/thread.jspa?messageID=229969

or here:

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-May/047267.html
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-May/047270.html

Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions on zfs send,receive,backups

2008-11-03 Thread Ross Smith
 Snapshots are not replacements for traditional backup/restore features.
 If you need the latter, use what is currently available on the market.
 -- richard

I'd actually say snapshots do a better job in some circumstances.
Certainly they're being used that way by the desktop team:
http://blogs.sun.com/erwann/entry/zfs_on_the_desktop_zfs

None of this is stuff I'm after personally btw.  This was just my
attempt to interpret the request of the OP.

Although having said that, the ability to restore single files as fast
as you can restore a whole snapshot would be a nice feature.  Is that
something that would be possible?

Say you had a ZFS filesystem containing a 20GB file, with a recent
snapshot.  Is it technically feasible to restore that file by itself
in the same way a whole filesystem is rolled back with zfs restore?
If the file still existed, would this be a case of redirecting the
file's top level block (dnode?) to the one from the snapshot?  If the
file had been deleted, could you just copy that one block?

Is it that simple, or is there a level of interaction between files
and snapshots that I've missed (I've glanced through the tech specs,
but I'm a long way from fully understanding them).

Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions on zfs send,receive,backups

2008-11-03 Thread Ross Smith
 If the file still existed, would this be a case of redirecting the
 file's top level block (dnode?) to the one from the snapshot?  If the
 file had been deleted, could you just copy that one block?

 Is it that simple, or is there a level of interaction between files
 and snapshots that I've missed (I've glanced through the tech specs,
 but I'm a long way from fully understanding them).


 It is as simple as a cp, or drag-n-drop in Nautilus.  The snapshot is
 read-only, so
 there is no need to cp, as long as you don't want to modify it or destroy
 the snapshot.
 -- richard

But that's missing the point here, which was that we want to restore
this file without having to copy the entire thing back.

Doing a cp or a drag-n-drop creates a new copy of the file, taking
time to restore, and allocating extra blocks.  Not a problem for small
files, but not ideal if you're say using ZFS to store virtual
machines, and want to roll back a single 20GB file from a 400GB
filesystem.

My question was whether it's technically feasible to roll back a
single file using the approach used for restoring snapshots, making it
an almost instantaneous operation?

ie:  If a snapshot exists that contains the file you want, you know
that all the relevant blocks are already on disk.  You don't want to
copy all of the blocks, but since ZFS follows a tree structure,
couldn't you restore the file by just restoring the one master block
for that file?

I'm just thinking that if it's technically feasible, I might raise an
RFE for this.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions on zfs send,receive,backups

2008-11-03 Thread Ross Smith
Hi Darren,

That's storing a dump of a snapshot on external media, but files
within it are not directly accessible.  The work Tim et all are doing
is actually putting a live ZFS filesystem on external media and
sending snapshots to it.

A live ZFS filesystem is far more useful (and reliable) than a dump,
and having the ability to restore individual files from that would be
even better.

It still doesn't help the OP, but I think that's what he was after.

Ross



On Mon, Nov 3, 2008 at 9:55 AM, Darren J Moffat [EMAIL PROTECTED] wrote:
 Ross wrote:

 Ok, I see where you're coming from now, but what you're talking about
 isn't zfs send / receive.  If I'm interpreting correctly, you're talking
 about a couple of features, neither of which is in ZFS yet, and I'd need the
 input of more technical people to know if they are possible.

 1.  The ability to restore individual files from a snapshot, in the same
 way an entire snapshot is restored - simply using the blocks that are
 already stored.

 2.  The ability to store (and restore from) snapshots on external media.

 What makes you say this doesn't work ?  Exactly what do you mean here
 because this will work:

$ zfs send [EMAIL PROTECTED] | dd of=/dev/tape

 Sure it might not be useful and I don't think that is what you mean here  so
 can you expand on sotre snapshots on external media.

 --
 Darren J Moffat

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] diagnosing read performance problem

2008-10-30 Thread Nigel Smith
Hi Matt
Well this time you have filtered out any SSH traffic on port 22 successfully.

But I'm still only seeing half of the conversation!
I see packets sent from client to server.
That is from source: 10.194.217.12 to destination: 10.194.217.3
So a different client IP this time

And the Duplicate ACK packets (often long bursts) are back in this capture.
I've looked at these a little bit more carefully this time,
and I now notice it's using the 'TCP selective acknowledgement' feature (SACK) 
on those packets.

Now this is not something I've come across before, so I need to do some
googling!  SACK is defined in RFC1208.

 http://www.ietf.org/rfc/rfc2018.txt

I found this explanation of when SACK is used:

 http://thenetworkguy.typepad.com/nau/2007/10/one-of-the-most.html
 http://thenetworkguy.typepad.com/nau/2007/10/tcp-selective-a.html

This seems to indicate these 'SACK' packets are triggered as a result 
of 'lost packets', in this case, it must be the packets sent back from
your server to the client, that is during your video playback.

Of course I'm not seeing ANY of those packets in this capture
because there are none captured from server to client!  
I'm still not sure why you cannot seem to capture these packets!

Oh, by the way, I probably should advise you to run...

 # netstat -i

..on the OpenSolaris box, to see if any errors are being counted
on the network interface.

Are you still seeing the link going up/down in '/var/admin/message'?
You are never going to do any good while that is happening.
I think you need to try a different network card in the server.
Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] HELP! SNV_97, 98, 99 zfs with iscsitadm and VMWare!

2008-10-29 Thread Nigel Smith
Hi Tano
Great to hear that you've now got this working!!

I understand you are using a Broadcom network card,
from your previous posts I can see you are using the 'bnx' driver.

I will raise this as a bug, but first please would you run 
'/usr/X11/bin/scanpci'
to indentify the exact 'vendor id' and 'device id' for the Broadcom network 
chipset,
and report that back here.

I must admit that this is the first I have heard of 'I/OAT DMA',
so I did some Googling on it, and found this links:

http://opensolaris.org/os/community/arc/caselog/2008/257/onepager/

To quote from that ARC case:

  All new Sun Intel based platforms have Intel I/OAT (I/O Acceleration
   Technology) hardware.

   The first such hardware is an on-systemboard asynchronous DMA engine
   code named Crystal Beach.

   Through a set of RFEs Solaris will use this hardware to implement
   TCP receive side zero CPU copy via a socket.

Ok, so I think that makes some sense, in the context of
the problem we were seeing. It's referring to how the network
adaptor transfers the data it has received, out of the buffer
and onto the rest of the operating system.

I've just looked to see if I can find the source code for 
the BNX driver, but I cannot find it.

Digging deeper we find on this page:
http://www.opensolaris.org/os/about/no_source/
..on the 'ON' tab, that:

Components for which there are currently no plans to release source
bnx driver (B)  Broadcom NetXtreme II Gigabit Ethernet driver

So the bnx driver is closed source :-(
Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] diagnosing read performance problem

2008-10-29 Thread Nigel Smith
Hi Matt
Can you just confirm if that Ethernet capture file, that you made available,
was done on the client, or on the server. I'm beginning to suspect you
did it on the client.

You can get a capture file on the server (OpenSolaris) using the 'snoop'
command, as per one of my previous emails.  You can still view the
capture file with WireShark as it supports the 'snoop' file format.

Normally it would not be too important where the capture was obtained,
but here, where something strange is happening, it could be critical to 
understanding what is going wrong and where.

It would be interesting to do two separate captures - one on the client
and the one on the server, at the same time, as this would show if the
switch was causing disruption.  Try to have the clocks on the client 
server synchronised as close as possible.
Thanks
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] diagnosing read performance problem

2008-10-29 Thread Nigel Smith
Hi Matt
In your previous capture, (which you have now confirmed was done
on the Windows client), all those 'Bad TCP checksum' packets sent by the 
client, 
are explained, because you must be doing hardware TCP checksum offloading
on the client network adaptor.  WireShark will capture the packets before
that hardware calculation is done, so the checksum all appear to be wrong,
as they have not yet been calculated!

  http://wiki.wireshark.org/TCP_checksum_offload
  http://www.wireshark.org/docs/wsug_html_chunked/ChAdvChecksums.html

Ok, so lets look at the new capture, 'snoop'ed on the OpenSolaris box.

I was surprised how small that snoop capture file was
 - only 753400 bytes after unzipping.
I soon realized why...

The strange thing is that I'm only seeing half of the conversation!
I see packets sent from client to server.
That is from source: 10.194.217.10 to destination: 10.194.217.3

I can also see some packets from
source: 10.194.217.5 (Your AD domain controller) to destination  10.194.217.3

But you've not capture anything transmitted from your
OpenSolaris server - source: 10.194.217.3

(I checked, and I did not have any filters applied in WireShark
that would cause the missing half!)
Strange! I'm not sure how you did that.

The half of the conversation that I can see looks fine - there
does not seem to be any problem.  I'm not seeing any duplication
of ACK's from the client in this capture.  
(So again somewhat strange, unless you've fixed the problem!)

I'm assuming your using a single network card in the Solaris server, 
but maybe you had better just confirm that.

Regarding not capturing SSH traffic and only capturing traffic from
( hopefully to) the client, try this:

 # snoop -o test.cap -d rtls0 host 10.194.217.10 and not port 22

Regarding those 'link down', 'link up' messages, '/var/adm/messages'.
I can tie up some of those events with your snoop capture file,
but it just shows that no packets are being received while the link is down,
which is exactly what you would expect.
But dropping the link for a second will surely disrupt your video playback!

If the switch is ok, and the cable from the switch is ok, then it does
now point towards the network card in the OpenSolaris box.  
Maybe as simple as a bad mechanical connection on the cable socket

BTW, just run '/usr/X11/bin/scanpci'  and identify the 'vendor id' and
'device id' for the network card, just in case it turns out to be a driver bug.
Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] diagnosing read performance problem

2008-10-28 Thread Nigel Smith
Hi Matt.
Ok, got the capture and successfully 'unzipped' it.
(Sorry, I guess I'm using old software to do this!)

I see 12840 packets. The capture is a TCP conversation 
between two hosts using the SMB aka CIFS protocol.

10.194.217.10 is the client - Presumably Windows?
10.194.217.3 is the server - Presumably OpenSolaris - CIFS server?

Using WireShark,
Menu: 'Statistics  Endpoints' show:

The Client has transmitted 4849 packets, and
the Server has transmitted 7991 packets.

Menu: 'Analyze  Expert info Composite':
The 'Errors' tab shows:
4849 packets with a 'Bad TCP checksum' error - These are all transmitted by the 
Client.

(Apply a filter of 'ip.src_host == 10.194.217.10' to confirm this.)

The 'Notes' tab shows:
..numerous 'Duplicate Ack's'
For example, for 60 different ACK packets, the exact same packet was 
re-transmitted 7 times!
Packet #3718 was duplicated 17 times.
Packet #8215 was duplicated 16 times.
packet #6421 was duplicated 15 times, etc.
These bursts of duplicate ACK packets are all coming from the client side.

This certainly looks strange to me - I've not seen anything like this before.
It's not going to help the speed to unnecessarily duplicate packets like
that, and these burst are often closely followed by a short delay, ~0.2 seconds.
And as far as I can see, it looks to point towards the client as the source
of the problem.
If you are seeing the same problem with other client PC, then I guess we need 
to 
suspect the 'switch' that connects them.

Ok, that's my thoughts  conclusion for now.
Maybe you could get some more snoop captures with other clients, and
with a different switch, and do a similar analysis.
Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

2008-10-27 Thread Nigel Smith
Hi Eugene
I'm delighted to hear you got your files back!

I've seen a few posts to this forum where people have
done some change to the hardware, and then found
that the ZFS pool have gone. And often you never
hear any more from them, so you assume they could
not recover it.

Thanks for reporting back your interesting story.
I wonder how many other people have been caught out
with this 'Host Protected Area' (HPA) and never
worked out that this was the cause...

Maybe one moral of this story is to make a note of
your hard drive and partitions sizes now, while
you have a working system.

If your using Solaris, maybe try 'prtvtoc'.
http://docs.sun.com/app/docs/doc/819-2240/prtvtoc-1m?a=view
(Unless someone knows a better way?)
Thanks
Nigel Smith


# prtvtoc /dev/rdsk/c1t1d0
* /dev/rdsk/c1t1d0 partition map
*
* Dimensions:
* 512 bytes/sector
* 1465149168 sectors
* 1465149101 accessible sectors
*
* Flags:
*   1: unmountable
*  10: read-only
*
* Unallocated space:
*   First SectorLast
*   Sector CountSector
*  34   222   255
*
*  First SectorLast
* Partition  Tag  FlagsSector CountSector  Mount Directory
   0  400256 1465132495 1465132750
   8 1100  1465132751 16384 1465149134
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

2008-10-27 Thread Nigel Smith
...check out that link that Eugene provided.
It was a GigaByte GA-G31M-S2L motherboard.
http://www.gigabyte.com.tw/Products/Motherboard/Products_Spec.aspx?ProductID=2693

Some more info on 'Host Protected Area' (HPA), relating to OpenSolaris here:
http://opensolaris.org/os/community/arc/caselog/2007/660/onepager/
http://bugs.opensolaris.org/view_bug.do?bug_id=5044205

Regards
Nigel Smith
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import problem

2008-10-27 Thread Nigel Smith
Hi Terry
Please could you post back to this forum the output from

 # zdb -l /dev/rdsk/...

... for each of the 5 drives in your raidz2.
(maybe best as an attachment)
Are you seeing labels with the error  'failed to unpack'?
What is the reported 'status' of your zpool?
(You have not provided a 'zpool status')
Thanks
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import: all devices online but: insufficient replicas

2008-10-27 Thread Nigel Smith
Hi Kristof 
Please could you post back to this forum the output from

# zdb -l /dev/rdsk/...

... for each of the storage devices in your pool,
while it is in a working condition on Server1.
(Maybe best as an attachment)
Then do the same again with the pool on Server2.

What is the reported 'status' of your zpool on Server2?
(You have not provided a 'zpool status')
Thanks
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] My 500-gig ZFS is gone: insufficient replicas, corrupted data

2008-10-27 Thread Nigel Smith
Hi Miles
I think you make some very good points in your comments.
It would be nice to get some positive feedback on these from Sun.

And my thought also on (quickly) looking at that bug  ARC case was
does not this also need to be factored into the SATA framework.

I really miss not having 'smartctl' (fully) working with PATA and 
SATA drives on x86 Solaris.

I've done a quick search on PSARC 2007/660 and it was
closed approved fast-track 11/28/2007.
I did a quick search, but I could not find any code that had been
committed to 'onnv-gate' that references this case.
Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   >