Re: [zfs-discuss] Single disk parity

2009-07-09 Thread Richard Elling

Christian Auby wrote:

On Wed, 8 Jul 2009, Moore, Joe wrote:
That's true for the worst case, but zfs mitigates
that somewhat by 
batching i/o into a transaction group.  This means
that i/o is done every 
30 seconds (or 5 seconds, depending on the version
you're running), 
allowing multiple writes to be written together in
the disparate 
locations.





I'd think that writing the same data two or three times is a much larger 
performance hit anyway. Calculating 5% parity and writing it in addition to the 
stripe might be heaps faster. Might try to do some tests on this.
  


Before you get too happy, you should look at the current constraints.
The minimum disk block size is 512 bytes for most disks, but there has
been talk in the industry of cranking this up to 2 or 4 kBytes.  For small
files, your 5% becomes 100%, and you might as well be happy now and
set copies=2.  The largest ZFS block size is 128 kBytes, so perhaps you
could do something with 5% overhead there, but you couldn't correct
very many bits with only 5%. How many bits do you need to correct?
I don't know... that is the big elephant in the room shaped like a question
mark.  Maybe zcksummon data will help us figure out what color the
elephant might be.

If you were to implement something at the DMU layer, which is where
copies are, then without major structural changes to the blkptr, you are
restricted to 3 DVAs.  So the best you could do there is 50% overhead,
which would be a 200% overhead for small files.

If you were to implement at the SPA layer, then you might be able to
get back to a more consistently small overhead, but that would require
implementing a whole new vdev type, which means integration with
install, grub, and friends.  You would need to manage spatial diversity,
which might impact the allocation code in strange ways, but surely is
possible. The spatial diversity requirement means you basically can't
gain much by replacing a compressor with additional data redundancy,
though it might be an interesting proposal for the summer of code.

Or you could just do it in user land, like par2.

Bottom line: until you understand the failure modes you're trying
to survive, you can't make significant progress except by accident.
We know that redundant copies allows us to correct all bits for very
little performance impact, but costs space. Trying to increase space
without sacrificing dependability will cost something -- most likely
performance.

NB, one nice thing about copies is that you can set it per-file system.
For my laptop, I don't set copies for the OS, but I do for my home
directory.  This is a case where I trade off dependability of read-only
data which, is available on CD or on the net, to gain a little bit of
space.  But I don't compromise on dependability for my data.
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating 10TB of data from NTFS is there a simple way?

2009-07-09 Thread Jim Klimov
You might also search for OpenSolaris NAS projects. Some that I've seen 
previously
involve nearly the same config you're building - a CF card or USB stick with 
the OS
and a number of HDDs in a zfs pool for the data only.

I am not certain which ones I've seen, but you can look for EON, and PulsarOS...

http://eonstorage.blogspot.com/2008_11_01_archive.html (features page)
http://eonstorage.blogspot.com/2009/05/eon-zfs-nas-0591-based-on-snv114.html

http://code.google.com/p/pulsaros/
http://pulsaros.digitalplayground.at/

Haven't yet tried them, though.

//Jim
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating 10TB of data from NTFS is there a simple way?

2009-07-09 Thread Jim Klimov
> Trying to spare myself the expense as this is my home system so budget is 
> a constraint. 

> What I am trying to avoid is having multiple raidz's because every time I 
> have another one I loose a lot of extra space to parity. Much like in raid 5.

There's a common perception which I tend to share now, that "consumer" drives 
have a somewhat higher rate of unreliability and failure. Some aspects relate to
the design priorities (i.e. balance price vs size vs duty cycle), or 
conspiracy-theory 
stuff (force consumers into buying more drives more often). 

Hand-made computers tend to increase that rate due to any number of reasons 
(components, connections, thermal issues, power source issues). I've passed 
that 
the hard way while building many home computers and cheap campus servers at 
my Uni. Including 24-drive linux filers with mdadm and hardware raid cards :)

Another problem is that larger drives take a lot longer to rebuild (about 4 
hours to
write a single drive in your case with an otherwise idle system) or even 
resilver
with a filled-up array like yours. This is especially a problem in classic RAID 
setups, where the whole drive is considered failed if anything goes wrong. It's
quite often that some hidden problem occurs with another drive of the array, so 
it is all considered dead, and the chance grows with disk size. That's one of 
many 
other valid good reasons why "enterprise" drives are smaller. Hopefully ZFS does
contain such failures down to a few blocks which have checksum mismatch.

Anyway, I'd not be comfortable with large unreliable-big-drive sets even with 
some
redundancy. Hence my somewhat arbitrary recommendation of 4-drive raidz1 sets.
The industry seems to agree that at most 7-9 drives are reasonable for a single
RAID5/6 volume (vdev in case of ZFS), though.

Since you already have 2 clean 1Tb disks, you can buy just 2 more. In the end 
you'd have one 4*1Tb raidz1 and two 4*1.5Tb raidz1 vdevs in a pool, summing 
up to 3+(4.5*2) = 12Gb of usable space in a redundant set. For me personally,
that would be worth its salt.

There may however be some discrepancy between the space on the first set
(3Gb) which amounts to just 2*1Tb drives freeing up. That can introduce more
costly corrections into my calculations (i.e. a 5*1Tb disk set)...

Concerning the sd/cf card for booting, I have no experience. From what I've
seen, you can google for notes on card booting in Eeepc and similar netbooks, 
and for comments on the making of livecd/liveusb - capable Solaris distros 
(see some at http://www.opensolaris.org/os/downloads/).

You'd probably need to make sure that the BIOS emulates the card as an IDE/SATA
hard disk device, and/or bundle the needed drivers into the Solaris miniroot.

> And last thx so very much for spending so much time and effort in 
> transferring 
> knowlege, I really do appreciate it. 

You're very welcome. I do hope this helps and you don't lose data in the 
progress,
due to my possible mistakes or misconceptions, or otherwise ;)

//Jim
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs root, jumpstart and flash archives

2009-07-09 Thread Lori Alt

On 07/09/09 17:25, Mark Michael wrote:

Thanks for the info.  Hope that the pfinstall changes to support zfs root flash 
jumpstarts can be extended to support luupgrade -f at some point soon.

BTW, where can I find an example profile?  do I just substitute in the 


  install_type flash_install
  archive_location ...

for

   install_type initial_install

??
  

Here's a sample:

install_type flash_install
archive_location nfs schubert:/export/home/lalt/mirror.flar
partitioning explicit
pool rpool auto auto auto mirror c0t1d0s0 c0t0d0s0



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Single disk parity

2009-07-09 Thread Christian Auby
> On Wed, 8 Jul 2009, Moore, Joe wrote:
> That's true for the worst case, but zfs mitigates
> that somewhat by 
> batching i/o into a transaction group.  This means
> that i/o is done every 
> 30 seconds (or 5 seconds, depending on the version
> you're running), 
> allowing multiple writes to be written together in
> the disparate 
> locations.
> 

I'd think that writing the same data two or three times is a much larger 
performance hit anyway. Calculating 5% parity and writing it in addition to the 
stripe might be heaps faster. Might try to do some tests on this.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs root, jumpstart and flash archives

2009-07-09 Thread Mark Michael
Thanks for the info.  Hope that the pfinstall changes to support zfs root flash 
jumpstarts can be extended to support luupgrade -f at some point soon.

BTW, where can I find an example profile?  do I just substitute in the 

  install_type flash_install
  archive_location ...

for

   install_type initial_install

??
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] creating a zpool inside a zone with zvols from the global zone

2009-07-09 Thread Alastair Neil
I'm not sure if this is the correct list for this query, however, I am
trying to create a number of zpools inside a zone.  I am running snv_117 and
this is a ipkg banded zone, here is the zone configuration:

a...@vs-idm:~$  zonecfg -z vsnfs-02 export
> create -b
> set zonepath=/rpool/zones/vsnfs-02
> set brand=ipkg
> set autoboot=true
> set ip-type=shared
> add net
> set address=xxx.xxx.xxx.xxx
> set physical=e1000g0
> end
> add device
> set match=/dev/zvol/dsk/rpool/[uw][0123]-test
> end
> add device
> set match=/dev/zvol/rdsk/rpool/[uw][0123]-test
> end
>

The device for the pool is a zvol created in the global zone and added to
the local zone using "add device" in zonecfg.

I get this error:

pfexec zpool create -m /VS/home/.u0 u0 /dev/zvol/dsk/rpool/u0-test
> cannot create 'u0': permission denied
>


I take it I am trying to do something that is not intended?

Thanks, Alastair
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool replace leaves pool degraded after resilvering

2009-07-09 Thread William Bauer
2009.06 is v111b, but you're running v111a.  I don't know, but perhaps the a->b 
transition addressed this issue, among others?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very slow ZFS write speed to raw zvol

2009-07-09 Thread Jim Klimov
After reading many-many threads on ZFS performance today (top of the list in 
the 
forum, and some chains of references), I applied a bit of tuning to the server.

In particular, I've set the zfs_write_limit_override to 384Mb so my cache is 
spooled 
to disks more frequently (if streaming lots of writes) and in smaller 
increments.

* echo zfs_write_limit_override/W0t402653184 | mdb -kw
set zfs:zfs_write_limit_override = 0x1800

The system seems to be working more smoothly (vs. jerky), and "zpool iostat" 
values are not quite as jumpy (i.e. 320MBps to 360MBps for a certain test). 
The results also seem faster and more consistent.

With this tuning applied, I'm writing to a 40G zvol, 1M records (count=1048576) 
of:
4k (bs=4096): 17s (12s), 241MBps
8k (bs=8192): 29s (18s), 282MBps
16k (bs=16384): 54s (30s), 303MBps
32k (bs=32768): 113s (56s), 290MBps
64k (bs=65536): 269s (104s), 243MBps

And 10240 larger records of:
1 MB (bs=1048576): 33s (8s), 310MBps
2 MB (bs=2097152): 74s (23s), 276MBps

And 1024 yet larger records:
1 MB (bs=1048576): 4s (1s), 256MBps
4 MB (bs=4194304): 12s (5s), 341MBps
16MB (bs=16777216): 71s (18s), 230MBps
32MB (bs=33554432): 150s (36s), 218MBps

So the zvol picture is quite better now (albeit not perfect - i.e. no values 
are near 
the 1GBps noted previously in "zpool iostat"), for both small and large blocks.

For filesystem dataset the new values are very similar (like, to tenths of a 
second 
on smaller blocksizes!) but as the blocksize grows, filesystems start losing to 
the
zvols. Overall the result seems lower than achieved before I tried tuning.

1M records (count=1048576) of:
4k (bs=4096): 17s (12s), 241MBps
8k (bs=8192): 29s (18s), 282MBps
16k (bs=16384): 67s (30s), 245MBps
32k (bs=32768): 144s (55s), 228MBps
64k (bs=65536): 275s (98s), 238MBps

And 10240 larger records go better:
1 MB (bs=1048576): 33s (9s), 310MBps
2 MB (bs=2097152): 70s (21s), 292MBps

And 1024 yet larger records:
1 MB (bs=1048576): 2.8s (0.8s), 366MBps
4 MB (bs=4194304): 12s (4s), 341MBps
16MB (bs=16777216): 55s (17s), 298MBps
32MB (bs=33554432): 140s (36s), 234MBps

Occasionally I did reruns; user time for the same setups can vary significantly
(like 65s vs 84s) while the system time stays pretty much the same.

"zpool iostat" shows larger values (like 320MBps typically) but I think that 
can be 
attributed to writing parity stripes on raidz vdevs.

//Jim

PS: for completeness, I'll try smaller blocks without tuning in a future post.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Changing GUID

2009-07-09 Thread Cyril Plisko
On Thu, Jul 9, 2009 at 8:42 PM, Norbert wrote:
> Does anyone have the code/script to change the GUID of a ZFS pool?

I did such tool for my client around a year ago and that client agreed
to release the code.
However, the API I've used is has been changed and not available
anymore. So you cannot compile it on recent Nevada releases. I may
consider to retrofit it if I will have enough time and motivation.

-- 
Regards,
Cyril
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Changing GUID

2009-07-09 Thread Norbert
Does anyone have the code/script to change the GUID of a ZFS pool?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question about user/group quotas

2009-07-09 Thread Greg Mason
Thanks for the link Richard,

I guess the next question is, how safe would it be to run snv_114 in
production? Running something that would be technically "unsupported"
makes a few folks here understandably nervous...

-Greg

On Thu, 2009-07-09 at 10:13 -0700, Richard Elling wrote:
> Greg Mason wrote:
> > I'm trying to find documentation on how to set and work with user and
> > group quotas on ZFS. I know it's quite new, but googling around I'm just
> > finding references to a ZFS quota and refquota, which are
> > filesystem-wide settings, not per user/group.
> >   
> 
> Cindy does an excellent job of keeping the ZFS Admin Guide up to date.
> http://opensolaris.org/os/community/zfs/docs/zfsadmin.pdf
> See the section titled "Setting User or Group Quotas on a ZFS File System"
>  -- richard
> > Also, after reviewing a few bugs, I'm a bit confused about which build
> > has user quota support. I recall that snv_111 has user quota support,
> > but not in rquotad. According to bug 6501037, ZFS user quota support is
> > in snv_114. 
> >
> > We're preparing to roll out OpenSolaris 2009.06 (snv_111b), and we're
> > also curious about being able to utilize ZFS user quotas, as we're
> > having problems with NFSv4 on our clients (SLES 10 SP2). We'd like to be
> > able to use NFSv3 for now (one large ZFS filesystem, with user quotas
> > set), until the flaws with our Linux NFS clients can be addressed.
> >
> >   
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-09 Thread William Bauer
I don't swear.  The word it bleeped was not a bad word
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-09 Thread William Bauer
I have a much more generic question regarding this thread.  I have a sun T5120 
(T2 quad core, 1.4GHz) with two 10K RPM SAS drives in a mirrored pool running 
Solaris 10 u7.  The disk performance seems horrible.  I have the same apps 
running on a Sun X2100M2 (dual core 1.8GHz AMD) also running Solaris 10u7 and 
an old, really poor performing SATA drive (also with ZFS), and its disk 
performance seems at least 5x better.

I'm not offering much detail here, but I had been attributing this to what I've 
always observed--Solaris on x86 performs far better than on sparc for any app 
I've ever used.

I guess the real question would be is ZFS ready for production in Solaris 10, 
or should I flar this bugger up and rebuild with UFS?  This thread concerns me, 
and I really want to keep ZFS on this system for its many features.  Sorry if 
this is off-topic, but you guys got me wondering.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Single disk parity

2009-07-09 Thread Richard Elling

Haudy Kazemi wrote:


Adding additional data protection options are commendable.  On the 
other hand I feel there are important gaps in the existing feature 
set that are worthy of a higher priority, not the least of which is 
the automatic recovery of uberblock / transaction group problems 
(see Victor Latushkin's recovery technique which I linked to in a 
recent post), 


This does not seem to be a widespread problem.  We do see the
occasional complaint on this forum, but considering the substantial
number of ZFS implementations in existence today, the rate seems
to be quite low.  In other words, the impact does not seem to be high.
Perhaps someone at Sun could comment on the call rate for such
conditions?
I counter this.  The user impact is very high when the pool is 
completely inaccessible due to a minor glitch in the ZFS metadata, and 
the user is told to restore from backups, particularly if they've been 
considering snapshots to be their backups (I know they're not the same 
thing).  The incidence rate may be low, but the impact is still high, 
and anecdotally there have been enough reports on list to know it is a 
real non-zero event probability.  


Impact in my context is statistical.  If everyone was hitting this problem,
then it would have been automated long ago.  Sun does track such reports
and will know their rate.

Think earth-asteroid collisions...doesn't happen very often but is 
catastrophic when it does happen.  Graceful handling of low incidence 
high impact events plays a role in real world robustness and is 
important in widescale adoption of a filesystem.  It is about software 
robustness in the face of failure vs. brittleness.  (In another area, 
I and others found MythTV's dependence on MySQL to be source of system 
brittleness.)  Google adopts robustness principles in its Google File 
System (GFS) by not trusting the hardware at all and then keeping a 
minimum of three copies of everything on three separate computers.


Right, so you also know that the reports of this problem are for 
non-mirrored

pools.  I agree with Google, mirrors work.



Consider the users/admin's dilemma of choosing between a filesystem 
that offers all the great features of ZFS but can be broken (and is 
documented to have broken) with a few miswritten bytes, or choosing a 
filesystem with no great features but is also generally robust to wide 
variety of minor metadata corrupt issues.  Complex filesystems need to 
take special measures that their complexity doesn't compromise their 
efforts at ensuring reliability.  ZFS's extra metadata copies provide 
this versus simply duplicating the file allocation table as is done in 
FAT16/32 filesystems (a basic filesystem).  The extra filesystem 
complexity also makes users more dependent upon built in recovery 
mechanisms and makes manual recovery more challenging. (This is an 
unavoidable result of more complicated filesystem design.)


I agree 100%.  But the question here is manual vs automated, not possible
vs impossible.  Even the venerable UFS fsck defers to manual if things are
really messed up.



More below.
followed closely by a zpool shrink or zpool remove command that lets 
you resize pools and disconnect devices without replacing them.  I 
saw postings or blog entries from about 6 months ago that this code 
was 'near' as part of solving a resilvering bug but have not seen 
anything else since.  I think many users would like to see improved 
resilience in the existing features and the addition of frequently 
long requested features before other new features are added.  
(Exceptions can readily be made for new features that are trivially 
easy to implement and/or are not competing for developer time with 
higher priority features.)


In the meantime, there is the copies flag option that you can use on 
single disks.  With immense drives, even losing 1/2 the capacity to 
copies isn't as traumatic for many people as it was in days gone 
by.  (E.g. consider a 500 gb hard drive with copies=2 versus a 128 
gb SSD).  Of course if you need all that space then it is a no-go.


Space, performance, dependability: you can pick any two.



Related threads that also had ideas on using spare CPU cycles for 
brute force recovery of single bit errors using the checksum:


There is no evidence that the type of unrecoverable read errors we
see are single bit errors.  And while it is possible for an error 
handling

code to correct single bit flips, multiple bit flips would remain as a
large problem space.  There are error codes which can correct multiple
flips, but they quickly become expensive.  This is one reason why nobody
does RAID-2.
Expensive in CPU cycles or engineering resources or hardware or 
dollars?  If the argument is CPU cycles, then that is the same case 
made against software RAID as a whole and an argument increasingly 
broken by modern high performance CPUs.  If the argument is 
engineering resources, consider the complexity of ZFS itself.  If the 
a

Re: [zfs-discuss] Question about user/group quotas

2009-07-09 Thread Richard Elling

Greg Mason wrote:

I'm trying to find documentation on how to set and work with user and
group quotas on ZFS. I know it's quite new, but googling around I'm just
finding references to a ZFS quota and refquota, which are
filesystem-wide settings, not per user/group.
  


Cindy does an excellent job of keeping the ZFS Admin Guide up to date.
http://opensolaris.org/os/community/zfs/docs/zfsadmin.pdf
See the section titled "Setting User or Group Quotas on a ZFS File System"
-- richard

Also, after reviewing a few bugs, I'm a bit confused about which build
has user quota support. I recall that snv_111 has user quota support,
but not in rquotad. According to bug 6501037, ZFS user quota support is
in snv_114. 


We're preparing to roll out OpenSolaris 2009.06 (snv_111b), and we're
also curious about being able to utilize ZFS user quotas, as we're
having problems with NFSv4 on our clients (SLES 10 SP2). We'd like to be
able to use NFSv3 for now (one large ZFS filesystem, with user quotas
set), until the flaws with our Linux NFS clients can be addressed.

  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question about user/group quotas

2009-07-09 Thread Darren J Moffat

Greg Mason wrote:

I'm trying to find documentation on how to set and work with user and
group quotas on ZFS. I know it's quite new, but googling around I'm just
finding references to a ZFS quota and refquota, which are
filesystem-wide settings, not per user/group.

Also, after reviewing a few bugs, I'm a bit confused about which build
has user quota support. I recall that snv_111 has user quota support,
but not in rquotad. According to bug 6501037, ZFS user quota support is
in snv_114. 


ZFS user quota support and the corresponding rquotad support did not 
integrate until build 114 so you would need to first install 2009.06 
then switch to the http://pkg.opensolaris.org/dev repository and 'pkg 
image-update'


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs root, jumpstart and flash archives

2009-07-09 Thread Lori Alt
Flash archive on zfs means archiving an entire root pool (minus any 
explicitly excluded datasets), not an individual BE.  These types of 
flash archives can only be installed using Jumpstart and are intended to 
install an entire system, not an individual BE.


Flash archives of a single BE could perhaps be implemented in the future.

Lori

On 07/09/09 09:56, Mark Michael wrote:

I've been hoping to get my hands on patches that permit Sol10U7 to do a 
luupgrade -f of a ZFS root-based ABE since Solaris 10 10/08.

Unfortunately, after applying patchids 119534-15 and 124630-26 to both the PBE and the 
miniroot of the OS image, I'm still getting the same "ERROR: Field 2 - Invalid disk 
name (insert_abe_name_here)".

The flarcreate command I used was simply 


  #  flarcreate -n root_var_no_snap /export/fssnap/flars/root_var

which created a flar file that was about 4 to 5 times the size of a UFS-based 
flar file.

I then used the command

  # luupgrade -f -n be_d70 -s /export/fssnap/os_image \
  > -a /export/fssnap/flars/root_var

which then failed with the pfinstall diagnostic given above.

What am I still doing wrong?

ttfn
mm
mark.o.mich...@boeing.com
mark.mich...@es.bss.boeing.com
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Question about user/group quotas

2009-07-09 Thread Greg Mason
I'm trying to find documentation on how to set and work with user and
group quotas on ZFS. I know it's quite new, but googling around I'm just
finding references to a ZFS quota and refquota, which are
filesystem-wide settings, not per user/group.

Also, after reviewing a few bugs, I'm a bit confused about which build
has user quota support. I recall that snv_111 has user quota support,
but not in rquotad. According to bug 6501037, ZFS user quota support is
in snv_114. 

We're preparing to roll out OpenSolaris 2009.06 (snv_111b), and we're
also curious about being able to utilize ZFS user quotas, as we're
having problems with NFSv4 on our clients (SLES 10 SP2). We'd like to be
able to use NFSv3 for now (one large ZFS filesystem, with user quotas
set), until the flaws with our Linux NFS clients can be addressed.

-- 
Greg Mason
System Administrator
Michigan State University
High Performance Computing Center

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs root, jumpstart and flash archives

2009-07-09 Thread Mark Michael
I've been hoping to get my hands on patches that permit Sol10U7 to do a 
luupgrade -f of a ZFS root-based ABE since Solaris 10 10/08.

Unfortunately, after applying patchids 119534-15 and 124630-26 to both the PBE 
and the miniroot of the OS image, I'm still getting the same "ERROR: Field 2 - 
Invalid disk name (insert_abe_name_here)".

The flarcreate command I used was simply 

  #  flarcreate -n root_var_no_snap /export/fssnap/flars/root_var

which created a flar file that was about 4 to 5 times the size of a UFS-based 
flar file.

I then used the command

  # luupgrade -f -n be_d70 -s /export/fssnap/os_image \
  > -a /export/fssnap/flars/root_var

which then failed with the pfinstall diagnostic given above.

What am I still doing wrong?

ttfn
mm
mark.o.mich...@boeing.com
mark.mich...@es.bss.boeing.com
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating 10TB of data from NTFS is there a simple way?

2009-07-09 Thread Xen Dar
> > I installed opensolaris and setup rpool as my base
> install on a single 1TB drive
> 
> If I understand correctly, you have rpool and the
> data pool configured all as one 
> pool?
Correct 

> That's not probably what you'd really want. For one
> part, the bootable root pool
> should all be available to GRUB from a single
> hardware device and this precludes
> any striping or raidz configurations for the root
> pool (only single drives and 
> mirrors are supported).
Makes sense
> You should rather make a separate root pool (depends
> on your installation size,
> RAM -> swap, number of OS versions to roll back); I'd
> suffice with anything from 
> 8 to 20Gb. And the rest of the disk (as another
> slice) becomes the data pool which
I would like to use a 16gb sd card for this- if there is a post or a resource 
on "how to" you know of pls point me to it.
> can later be expanded by adding stripes. Obviously,
> data already on the disk 
> won't magically become striped to all drives unless
> you rewrite it.
> 
> > a single 1TB drive
> 
> Minor detail: I thought you were moving 1.5TB disks?
> Or did you find a drive with
> adequately few data (1 TB used)?
I have 2 x 1TB drives that are clean and 8 by 1.5TB drives with all my data on.
> > transfering data accross till the drive was empty
> 
> I thought NTFS driver for Solaris is read-only?
Nope I copied(not moved) all the data 800GB so far in 3 and a half hours 
succesfully to my rpool.

> Not a good transactional approach. Delete original
> data only after all copying has 
> completed (and perhaps cross-checked) and the disk
> can actually be reused in the
> ZFS pool.
> 
> For example, if you were to remake the pool (as
> suggested above for rpool and 
> below for raidz data pool) - where would you re-get
> the original data for copying 
> over again?
> 
> > I havent worked out if I can transform my zpool int
> a zraid after I have 
> > copied all my data.
> 
> My guess would be - no, you can't (not directly at
> least). I think you can mirror the
> striped pool's component drives on the fly, by buying
> new drives one at a time - 
> which requires buying these drives. Or if you buy and
> attach all 8-9 drives at once,
Trying to spare myself the expense as this is my home system so budget is a 
constraint. 
> you can build another pool with raidz layout and
> migrate all data to it. Your old 
> drives can then be attached to this pool as another
> raidz vdev stripe (or even 
> mirror, but that's probably not needed for your
> usecase). These scenarios are
> not unlike raid50 or raid51, respectively.
> 
> In case of striping, you can build and expand your
> pool by vdev's of different 
> layout and size. As said before, currently there's a
> problem that you can't shrink
> the pool to remove devices (other than break mirrors
> into single drives).
> 
> Perhaps you can get away by buying now only the
> "parity" drives for your future 
> pool layout (which depends on the number of
> motherboard/controller connectors,
> and power source capacity, and your computer case
> size, etc.) and following the 
> ideas for "best-case" scenario from my post.
Motherboard has 7 sata connectors in addition I have a Intel sata raid 
controller with 6 connectors which I havent put on yet and I am using a dual 
psu coolermaster case which supports 16 drives
 
> 
> Then you'd start the pool by making a raidz1 device
> of 3-5 drives total (new empty 
> ones, possibly including the "missing" fake parity
> device), and then making and 
> attaching to the pool more new similar raidz vdev's
> as you free up NTFS disks.
> 
> I did some calculations on this last evening.
> 
> For example, if your data fits on 8 "data" drives,
> you can make 1*8-Ddrive raidz1 
> set with 9 drives (8+1), 2*4-Ddrive sets with 10
> drives (8+2), 3*3-Ddrive sets with 
> 12 drives (9+3). 
> 
> I'd buy 4 new drives and stick with the latter
> 12-drive pool scenario - 
> 1) build a complete 4-drive raidz1 set (3-Ddrive +
> 1*Pdrive), 
> 2) move over 3 drives worth of data,
> 3) build and attach a fake 4-drive raidz1 set
> (3-Ddrive + 1 missing Pdrive),
> 4) move over 3 drives worth of data,
> 5) build and attach a fake 4-drive raidz1 set
> (3-Ddrive + 1 missing Pdrive),
> 6) move over 2 drives worth of data,
> 7) complete the parities for the missing Pdrives of
> the two faked sets.
> 
> This does not in any way involve the capacity of your
> bootroot drives (which I think
> were expected to be a CF card, no?). So you already
> have at least one such drive ;)
> Even if your current drive is partially consumed by
> the root pool, I think you can 
> sacrifice some 20Gb on each drive in one 4-disk
> raidz1 vdev. You can mirror the 
> root pool with one of these drives, and make a
> mirrored swap pool on the other 
> couple.
Ok I am going to have to read through this slowly and fully understand the fake 
raid scenario. What I am trying to avoid is having multiple raidz's because 
every time I have another one I loos

Re: [zfs-discuss] Very slow ZFS write speed to raw zvol

2009-07-09 Thread Ross Walker

On Jul 9, 2009, at 4:22 AM, Jim Klimov  wrote:

To tell the truth, I expected zvols to be faster than filesystem  
datasets. They seem
to have less overhead without inodes, posix, acls and so on. So I'm  
puzzled by test

results.

I'm now considering the dd i/o block size, and it means a lot  
indeed, especially if

compared to zvol results with small blocks like 64k.

I ran a number of tests with a zvol recreated by commands before  
each run (this
may however cause varying fragmentation impacting results of  
different runs):


# zfs destroy -r pond/test; zfs create -V 30G pond/test; zfs set  
compression=off pond/test; sync; dd if=/dev/zero of=/dev/zvol/rdsk/ 
pond/test count=1000 bs=512; sync


and tests going like

# time dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1024 bs=1048576
1024+0 records in
1024+0 records out

real0m42.442s
user0m0.006s
sys 0m4.292s

The test progresses were quite jumpy (with "zpool iostat pond 1"  
values varying

from 30 to 70 MBps, reads coming in sometimes).

So I'd stick to overall result - the rounded wallclock time it takes  
to write 1024
records of varying size and resulting average end-user MBps. I also  
write "sys"
time since that's what is consumed by the kernel and the disk  
subsystem, after all.
I don't write zpool iostat speeds, since they vary too much and I  
don't bother
with a spreadsheen right now. But the reported values stay about  
halfway between
"wallclock MBps" ans "sys MBps" calculations, on the perceived  
average, peaking

at about 350MBps for large block sizes (>4MB).

1 MB (bs=1048576): 42s (4s), 24MBps
4 MB (bs=4194304): 42s (15s), 96MBps
16MB (bs=16777216): 129s-148s (62-64s), 127-110MBps
32MB (bs=33554432, 40Gb zvol): 303s (127s), 108MBps

Similar results for writing a file to a filesystem; "zpool iostat"  
values again
jumped anywhere between single MBps to GBps. Simple cleanups used  
like:


# rm /pool/test30g; sync; time dd if=/dev/zero of=/pool/test30g
count=1024 bs=33554432

Values remain somewhat consistent (in the same league, at least):
1 MB (bs=1048576, 10240 blocks): 20-21s (7-8s), 512-487MBps

1 MB (bs=1048576): 2.3s (0.6s), 445MBps
4 MB (bs=4194304): 8s (3s), 512MBps
16MB (bs=16777216): 37s (15s), 442MBps
32MB (bs=33554432): 74-103s (32-42s), 442-318MBps

64Kb (bs=65536, 545664 blocks): 94s (47s), 362MBps

All in all, to make more precise results these tests should be made  
in greater

numbers and averaged. But here we got some figures to think about...

On a side note, now I'll pay more attention to tuning suggestions  
which involve
multi-megabyte buffers for network sockets, etc. They can actually  
cause an

impact to performance many times over!

On another note,

For some reason I occasionally got results like this:
write: File too large
1+0 records in
1+0 records out

I think the zvol was not considered created by that time. In about  
10-15 sec I was
able to commence the test run. Perhaps it helped that I  
"initialized" the zvol by a

small write after creation, then:
# dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1000 bs=512
Strange...


When running throughput tests the block sizes to be concerned about  
are: 4k, 8k, 16k and 64k. These are the sizes that most file systems  
an databases use.


If you get 4k to perform well chances are the others will fall into  
line.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs root, jumpstart and flash archives

2009-07-09 Thread Fredrich Maney
Thanks everyone for the patch IDs.

On Wed, Jul 8, 2009 at 4:50 PM, Enda O'Connor wrote:
> Hi
> for sparc
> 119534-15
> 124630-26
>
>
> for x86
> 119535-15
> 124631-27
>
> higher rev's of these will also suffice.
>
> Note these need to be applied to the miniroot of the jumpstart image so that
> it can then install zfs flash archive.
>  please read the README notes in these for more specific instructions,
> including instructions on miniroot patching.
>
> Enda
>
> Fredrich Maney wrote:
>>
>> Any idea what the Patch ID was?
>>
>> fpsm
>>
>> On Wed, Jul 8, 2009 at 3:43 PM, Bob
>> Friesenhahn wrote:
>>>
>>> On Wed, 8 Jul 2009, Jerry K wrote:
>>>
 It has been a while since this has been discussed, and I am hoping that
 you can provide an update, or time estimate.  As we are several months
 into
 Update 7, is there any chance of an Update 7 patch, or are we still
 waiting
 for Update 8.
>>>
>>> I saw that a Solaris 10 patch for supporting Flash archives on ZFS came
>>> out
>>> about a week ago.
>>>
>>> Bob
>>> --
>>> Bob Friesenhahn
>>> bfrie...@simple.dallas.tx.us,
>>> http://www.simplesystems.org/users/bfriesen/
>>> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>>> ___
>>> zfs-discuss mailing list
>>> zfs-discuss@opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Issues with ZFS and SVM?

2009-07-09 Thread Peter Eriksson
I wonder exactly what's going on. Perhaps it is the cache flushes that is 
causing the SCSI errors 
when trying to use the SSD (Intel X25-E and X25-M) disks? Btw, I'm seeing the 
same behaviour on 
both an X4500 (SATA/Marwell controller) and the X4240 (SAS/LSI controller). 
Well, almost. On the 
X4500 I didn't seen the errors printed on the console, but things behaved 
strangely - and I did see 
the same speedup.

If SVM silently disables cache flushes then perhaps there should be a HUGE 
warning printed somewhere 
(ZFS FAQ? Solaris documentation? In zpool when creating/adding devices?) 
about using ZFS with SVM? 

I wonder what the potential danger might be _if_ SVM disables cache flushes for 
the SLOG... 
Sure, that might mean a missed update on the filesystem, but since the data 
disks on the pool 
is raw disk devices the ZFS filesystem should be stable (sans any possibly 
missed updates).
I think I can live with that. What I don't want is a corrupt 16TB zpool in case 
of a power outage...

Message was edited by: pen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating 10TB of data from NTFS is there a simple way?

2009-07-09 Thread Jim Klimov
One more note,

> For example, if you were to remake the pool (as suggested above for rpool and
> below for raidz data pool) - where would you re-get the original data for 
> copying
> over again?

Of course, if you take on with the idea of buying 4 drives and building a 
raidz1 vdev
right away, and if you actually moved (deleted) the data from the NTFS disk, you
should start by creating this new pool with a complete raidz1 vdev.

Then you transfer (copy then delete) data to it from your current ZFS pool and 
only
then you remake/migrate the root pool if needed. 

Perhaps it would make sense to start with a faked raidz1 array (along with a 
new 
smaller root pool on its drives) made of just 3 more 1Tb disks, so you would 
just 
recycle and add your current zfs drive as a parity disk to this pool after all 
is 
complete.

As you see, there's lots of options depending on budget, creativity and other 
factors. It is possible that in the course of your quest you'll try several of 
them.

Starting out with a transactionable approach (i.e. not deleting the originals 
until
necessary) pays off in such cases.

//Jim
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating 10TB of data from NTFS is there a simple way?

2009-07-09 Thread Jim Klimov
> I installed opensolaris and setup rpool as my base install on a single 1TB 
> drive

If I understand correctly, you have rpool and the data pool configured all as 
one 
pool?

That's not probably what you'd really want. For one part, the bootable root pool
should all be available to GRUB from a single hardware device and this precludes
any striping or raidz configurations for the root pool (only single drives and 
mirrors are supported).

You should rather make a separate root pool (depends on your installation size,
RAM -> swap, number of OS versions to roll back); I'd suffice with anything 
from 
8 to 20Gb. And the rest of the disk (as another slice) becomes the data pool 
which
can later be expanded by adding stripes. Obviously, data already on the disk 
won't magically become striped to all drives unless you rewrite it.

> a single 1TB drive

Minor detail: I thought you were moving 1.5TB disks? Or did you find a drive 
with
adequately few data (1 TB used)?

> transfering data accross till the drive was empty

I thought NTFS driver for Solaris is read-only?

Not a good transactional approach. Delete original data only after all copying 
has 
completed (and perhaps cross-checked) and the disk can actually be reused in the
ZFS pool.

For example, if you were to remake the pool (as suggested above for rpool and 
below for raidz data pool) - where would you re-get the original data for 
copying 
over again?

> I havent worked out if I can transform my zpool int a zraid after I have 
> copied all my data.

My guess would be - no, you can't (not directly at least). I think you can 
mirror the
striped pool's component drives on the fly, by buying new drives one at a time 
- 
which requires buying these drives. Or if you buy and attach all 8-9 drives at 
once, 
you can build another pool with raidz layout and migrate all data to it. Your 
old 
drives can then be attached to this pool as another raidz vdev stripe (or even 
mirror, but that's probably not needed for your usecase). These scenarios are
not unlike raid50 or raid51, respectively.

In case of striping, you can build and expand your pool by vdev's of different 
layout and size. As said before, currently there's a problem that you can't 
shrink
the pool to remove devices (other than break mirrors into single drives).

Perhaps you can get away by buying now only the "parity" drives for your future 
pool layout (which depends on the number of motherboard/controller connectors,
and power source capacity, and your computer case size, etc.) and following the 
ideas for "best-case" scenario from my post. 

Then you'd start the pool by making a raidz1 device of 3-5 drives total (new 
empty 
ones, possibly including the "missing" fake parity device), and then making and 
attaching to the pool more new similar raidz vdev's as you free up NTFS disks.

I did some calculations on this last evening.

For example, if your data fits on 8 "data" drives, you can make 1*8-Ddrive 
raidz1 
set with 9 drives (8+1), 2*4-Ddrive sets with 10 drives (8+2), 3*3-Ddrive sets 
with 
12 drives (9+3). 

I'd buy 4 new drives and stick with the latter 12-drive pool scenario - 
1) build a complete 4-drive raidz1 set (3-Ddrive + 1*Pdrive), 
2) move over 3 drives worth of data,
3) build and attach a fake 4-drive raidz1 set (3-Ddrive + 1 missing Pdrive),
4) move over 3 drives worth of data,
5) build and attach a fake 4-drive raidz1 set (3-Ddrive + 1 missing Pdrive),
6) move over 2 drives worth of data,
7) complete the parities for the missing Pdrives of the two faked sets.

This does not in any way involve the capacity of your bootroot drives (which I 
think
were expected to be a CF card, no?). So you already have at least one such 
drive ;)
Even if your current drive is partially consumed by the root pool, I think you 
can 
sacrifice some 20Gb on each drive in one 4-disk raidz1 vdev. You can mirror the 
root pool with one of these drives, and make a mirrored swap pool on the other 
couple.

//Jim
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Booting from detached mirror disk

2009-07-09 Thread Jim Klimov
You might also want to force ZFS into accepting a faulty root pool:

# zpool set failmode=continue rpool

//Jim
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs snapshoot of rpool/* to usb removable drives?

2009-07-09 Thread Jim Klimov
You can also select which snapshots you'd like to copy - and egrep away what you
don't need.

Here's what I did to back up some servers to a filer (as compressed ZFS 
snapshots
stored into files or further simple deployment on multiple servers, as well as 
offsite rsyncing of the said files). The example below is a framework from our 
scratchpad docs, modify it to a specific server's environment.

Apparently, such sending and receiving examples (see below) can be piped 
together without use of files (and gzip, ssh, whatever) within a local system.

# ZFS snapshot dumps

# prepare
TAGPRV='20090427-01'
TAGNEW='20090430-01-running'
zfs snapshot -r pool/zones@"$TAGNEW"

# incremental dump over NFS (needs set TAGNEW/TAGPRV)
cd /net/back-a/export/DUMP/manual/`hostname` && \
for ZSn in `zfs list -t snapshot | grep "$TAGNEW" | awk '{ print $1 }'`; do 
ZSp=`echo $ZSn | sed "s/$TAGNEW/$TAGPRV/"`; Fi="`hostname`%`echo $ZSn | sed 
's/\//_/g'`.incr.zfsshot.gz"; echo "=== `date`"; echo "= prev: $ZSp"; echo 
"= new: $ZSn"; echo "= new: incr-file: $Fi"; /bin/time zfs send -i 
"$ZSp" "$ZSn" | /bin/time pigz -c - > "$Fi"; echo "   res = [$?]"; done

# incremental dump over ssh (needs set TAGNEW/TAGPRV; paths hardcoded in the 
end)
for ZSn in `zfs list -t snapshot | grep "$TAGNEW" | awk '{ print $1 }'`; do 
ZSp=`echo $ZSn | sed "s/$TAGNEW/$TAGPRV/"`; Fi="`hostname`%`echo $ZSn | sed 
's/\//_/g'`.incr.zfsshot.gz"; echo "=== `date`"; echo "= prev: $ZSp"; echo 
"= new: $ZSn"; echo "= new: incr-file: $Fi"; /bin/time zfs send -i 
"$ZSp" "$ZSn" | /bin/time pigz -c - | ssh back-a "cat > 
/export/DUMP/manual/`hostname`/$Fi"; echo "   res = [$?]"; done

All in all, these lines send an incremental snapshot between $TAGPRV and 
$TAGNEW to per-server directories into per-snapshot files. They are quickly 
compressed with pigz (parallel gzip) before writing.

First of all you'd of course need an initial dump (a full dump of any snapshot):

# Initial dump of everything except swap volumes
zfs list -H -t snapshot | egrep -vi 'swap|rpool/dump' | grep "@$TAGPRV" | awk 
'{ print $1 }' | while read Z; do F="`hostname`%`echo $Z | sed 
's/\//_/g'`.zfsshot"; echo "`date`: $Z > $F.gz"; time zfs send "$Z" | pigz -9 > 
$F.gz; done

Now, if your snapshots were named in an incrementing manner (like these 
timestamped examples above), you are going to have a directory with files 
named like this (it's assumed that incremented snapshots all make up a valid 
chain):

servername%p...@20090214-01.zfsshot.gz
servername%pool_zo...@20090214-01.zfsshot.gz
servername%pool_zo...@20090405-03.incr.zfsshot.gz
servername%pool_zo...@20090427-01.incr.zfsshot.gz
servername%pool_zones_gene...@20090214-01.zfsshot.gz
servername%pool_zones_gene...@20090405-03.incr.zfsshot.gz
servername%pool_zones_gene...@20090427-01.incr.zfsshot.gz
servername%pool_zones_general_...@20090214-01.zfsshot.gz
servername%pool_zones_general_...@20090405-03.incr.zfsshot.gz
servername%pool_zones_general_...@20090427-01.incr.zfsshot.gz

The last one is a large snapshot of the zone (ns4) while the first ones are 
small 
datasets which simply form nodes in the hierarchical tree. There's lots of 
these 
usually :)

You can simply import these files into a zfs pool by a script like:

# for F in *.zfsshot.gz; do echo "=== $F"; gzcat "$F" | time zfs recv -nFvd 
pool; done

Probably better use "zfs recv -nFvd" first (no-write verbose mode) to be 
certain 
about your write-targets and about overwriting stuff (i.e. "zfs recv -F" would 
destroy any newer snapshots, if any - so you can first check which ones, and 
possibly clone/rename them first).

// HTH, Jim Klimov
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very slow ZFS write speed to raw zvol

2009-07-09 Thread Jim Klimov
To tell the truth, I expected zvols to be faster than filesystem datasets. They 
seem
to have less overhead without inodes, posix, acls and so on. So I'm puzzled by 
test 
results.

I'm now considering the dd i/o block size, and it means a lot indeed, 
especially if
compared to zvol results with small blocks like 64k. 

I ran a number of tests with a zvol recreated by commands before each run (this 
may however cause varying fragmentation impacting results of different runs):

# zfs destroy -r pond/test; zfs create -V 30G pond/test; zfs set 
compression=off pond/test; sync; dd if=/dev/zero of=/dev/zvol/rdsk/pond/test 
count=1000 bs=512; sync

and tests going like

# time dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1024 bs=1048576
1024+0 records in
1024+0 records out

real0m42.442s
user0m0.006s
sys 0m4.292s

The test progresses were quite jumpy (with "zpool iostat pond 1" values varying
from 30 to 70 MBps, reads coming in sometimes). 

So I'd stick to overall result - the rounded wallclock time it takes to write 
1024 
records of varying size and resulting average end-user MBps. I also write "sys" 
time since that's what is consumed by the kernel and the disk subsystem, after 
all.
I don't write zpool iostat speeds, since they vary too much and I don't bother
with a spreadsheen right now. But the reported values stay about halfway 
between 
"wallclock MBps" ans "sys MBps" calculations, on the perceived average, peaking 
at about 350MBps for large block sizes (>4MB).

1 MB (bs=1048576): 42s (4s), 24MBps
4 MB (bs=4194304): 42s (15s), 96MBps
16MB (bs=16777216): 129s-148s (62-64s), 127-110MBps
32MB (bs=33554432, 40Gb zvol): 303s (127s), 108MBps

Similar results for writing a file to a filesystem; "zpool iostat" values again 
jumped anywhere between single MBps to GBps. Simple cleanups used like:

# rm /pool/test30g; sync; time dd if=/dev/zero of=/pool/test30g 
count=1024 bs=33554432

Values remain somewhat consistent (in the same league, at least):
1 MB (bs=1048576, 10240 blocks): 20-21s (7-8s), 512-487MBps

1 MB (bs=1048576): 2.3s (0.6s), 445MBps
4 MB (bs=4194304): 8s (3s), 512MBps
16MB (bs=16777216): 37s (15s), 442MBps
32MB (bs=33554432): 74-103s (32-42s), 442-318MBps

64Kb (bs=65536, 545664 blocks): 94s (47s), 362MBps

All in all, to make more precise results these tests should be made in greater 
numbers and averaged. But here we got some figures to think about...

On a side note, now I'll pay more attention to tuning suggestions which involve
multi-megabyte buffers for network sockets, etc. They can actually cause an 
impact to performance many times over!

On another note,

For some reason I occasionally got results like this:
write: File too large
1+0 records in
1+0 records out

I think the zvol was not considered created by that time. In about 10-15 sec I 
was
able to commence the test run. Perhaps it helped that I "initialized" the zvol 
by a
small write after creation, then: 
# dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1000 bs=512
Strange...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool replace leaves pool degraded after resilvering

2009-07-09 Thread Maurilio Longo
I forgot to mention this is a 

SunOS biscotto 5.11 snv_111a i86pc i386 i86pc

version.

Maurilio.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool replace leaves pool degraded after resilvering

2009-07-09 Thread Maurilio Longo
Hi,

I have a pc where a pool suffered a disk failure, I did replace the failed disk 
and the pool resilvered but, after resilvering, it was in this state

mauri...@biscotto:~# zpool status iscsi
  pool: iscsi
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 12h33m with 0 errors on Thu Jul  9 00:07:12 
2009
config:

NAME STATE READ WRITE CKSUM
iscsiDEGRADED 0 0 0
  mirror ONLINE   0 0 0
c2t0d0   ONLINE   0 0 0
c2t5d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c2t6d0   ONLINE   0 0 0
c2t8d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c11t0d0  ONLINE   0 0 0
c11t1d0  ONLINE   0 0 0
  mirror DEGRADED 0 0 0
c11t2d0  ONLINE   0 0 0
c11t3d0  DEGRADED 0 0 23,0M  too many errors
cache
  c1t4d0 ONLINE   0 0 0

errors: No known data errors

It says it resilvered ok and that there are no known data errors, but pool is 
still marked as degraded.

I did a zpool clear and now it says it is ok

mauri...@biscotto:~# zpool status
  pool: iscsi
 state: ONLINE
 scrub: resilver completed after 12h33m with 0 errors on Thu Jul  9 00:07:12 
2009
config:

NAME STATE READ WRITE CKSUM
iscsiONLINE   0 0 0
  mirror ONLINE   0 0 0
c2t0d0   ONLINE   0 0 0
c2t5d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c2t6d0   ONLINE   0 0 0
c2t8d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c11t0d0  ONLINE   0 0 0
c11t1d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c11t2d0  ONLINE   0 0 0
c11t3d0  ONLINE   0 0 0  326G resilvered
cache
  c1t4d0 ONLINE   0 0 0

errors: No known data errors

Look at c11t3d0 which now reads 326G resilvered; my question is: is the pool 
ok? Why had I to issue a zpool clear if the resilvering process completed 
without problems?

Best regards.

Maurilio.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very slow ZFS write speed to raw zvol

2009-07-09 Thread Jim Klimov
Hmm, scratch that. Maybe.

I did not first get the point that your writes to a filesystem dataset work 
quickly.
Perhaps filesystem is (better) cached indeed, i.e. *maybe* zvol writes are 
synchronous and zfs writes may be cached and thus async? Try playing around 
with relevant dataset attributes...

I'm running a test on my system (a snv_114 Thumper, 16Gb RAM, used for other 
purposes as well), the CPU is mostly idle now (2.5-3.2% kernel time, that's 
about 
it). Seems I have results not unlike yours. Not cool because I wanted to play 
with
COMSTAR iSCSI - and I'm not sure it will perform well ;)

I'm dd'ing 30Gb to an uncompressed test zvol with same 64kb block sizes (maybe 
they are too small?), and zpool iostat goes like this - a hundred IOs at 7Mbps 
for a
minute, then a burst of 100-170Mbps and 20-25K IOps for a second:

pond5.79T  4.41T  0106  0  7.09M
pond5.79T  4.41T  0  1.93K  0  20.7M
pond5.79T  4.41T  0  13.3K  0   106M
pond5.79T  4.41T  0116  0  7.76M
pond5.79T  4.41T  0108  0  7.23M
pond5.79T  4.41T  0107  0  7.16M
pond5.79T  4.41T  0107  0  7.16M

or

pond5.79T  4.41T  0117  0  7.83M
pond5.79T  4.41T  0  5.61K  0  49.7M
pond5.79T  4.41T  0  19.0K504   149M
pond5.79T  4.41T  0104  0  6.96M

Weird indeed.

It wrote 10Gb (according to "zfs get usedbydataset pond/test") taking roughly 
30 
minutes after which I killed it.

Now, writing to an uncompressed filesystem dataset (although very far from 
what's trumpeted as Thumper performance) yields quite different numbers:

pond5.80T  4.40T  1  3.64K   1022   457M
pond5.80T  4.40T  0866967  75.7M
pond5.80T  4.40T  0  4.65K  0   586M
pond5.80T  4.40T  6802  33.4K  69.2M
pond5.80T  4.40T 29  2.44K  1.10M   301M
pond5.80T  4.40T 32691   735K  25.0M
pond5.80T  4.40T 56  1.59K  2.29M   184M
pond5.80T  4.40T150768  4.61M  10.5M
pond5.80T  4.40T  2  0  25.5K  0
pond5.80T  4.40T  0  2.75K  0   341M
pond5.80T  4.40T  7  3.96K   339K   497M
pond5.80T  4.39T 85740  3.57M  59.0M
pond5.80T  4.39T 67  0  2.22M  0
pond5.80T  4.39T  9  4.67K   292K   581M
pond5.80T  4.39T  4  1.07K   126K   137M
pond5.80T  4.39T 27333   338K  9.15M
pond5.80T  4.39T  5  0  28.0K  3.99K
pond5.82T  4.37T  1  5.42K  1.67K   677M
pond5.83T  4.37T  3  1.69K  8.36K   173M
pond5.83T  4.37T  2  0  5.49K  0
pond5.83T  4.37T  0  6.32K  0   790M
pond5.83T  4.37T  2290  7.95K  27.8M
pond5.83T  4.37T  0  9.64K  1.23K  1.18G

The numbers are jumpy (maybe due to fragmentation, other processes, etc.) but
there are often spikes in excess of 500MBps.

The whole test took a relatively little time:

# time dd if=/dev/zero of=/pond/tmpnocompress/test30g bs=65536 count=50
50+0 records in
50+0 records out

real1m27.657s
user0m0.302s
sys 0m46.976s

# du -hs /pond/tmpnocompress/test30g 
  30G   /pond/tmpnocompress/test30g

To detail about the pool: 

The pool is on a Sun X4500 with 48 250Gb SATA drives. It was created as a 9x5 
set (9 stripes made of 5-disk raidz1 vdevs) spread across different 
controllers, 
with the command:

# zpool create -f pond \
raidz1 c0t0d0 c1t0d0 c4t0d0 c6t0d0 c7t0d0 \
raidz1 c0t1d0 c1t2d0 c4t3d0 c6t5d0 c7t6d0 \
raidz1 c1t1d0 c4t1d0 c5t1d0 c6t1d0 c7t1d0 \
raidz1 c0t2d0 c4t2d0 c5t2d0 c6t2d0 c7t2d0 \
raidz1 c0t3d0 c1t3d0 c5t3d0 c6t3d0 c7t3d0 \
raidz1 c0t4d0 c1t4d0 c4t4d0 c6t4d0 c7t4d0 \
raidz1 c0t5d0 c1t5d0 c4t5d0 c5t5d0 c7t5d0 \
raidz1 c0t6d0 c1t6d0 c4t6d0 c5t6d0 c6t6d0 \
raidz1 c1t7d0 c4t7d0 c5t7d0 c6t7d0 c7t7d0 \
spare c0t7d0

Alas, while there were many blogs, I couldn't find a definitive answer last 
year as
to which Thumper layout is optimal in performance and/or reliability (in regard 
to 
6 controllers of 8 disks each, with 2 disks on one of the controllers reserved 
for 
booting). 

As a result, we spread each raidz1 across 5 controllers, so the loss of one
controller should have minimal impact on data loss on the average. Since the 
system layout is not symmetrical, some controllers are more important than 
others (say, the boot one).

//Jim
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating 10TB of data from NTFS is there a simple way?

2009-07-09 Thread Lejun Zhu
> Ok so this is my solution, pls be advised I am a
> total linux nube so I am learning as I go along. I
> installed opensolaris and setup rpool as my base
> install on a single 1TB drive. I attached one of my
> NTFS drives to the system then used a utility called
> prtparts to get the name of the NTFS drive attached
> and then mounted it succesfully.
> I then started transfering data accross till the
> drive was empty (this is currently in progress) Once
> thats done I will add the empty NTFS drive to my ZFS
> pool and repeat the operation with my other drives.
> 
> This leaves me with the issue of redundancy which is
> sorely lacking, ideally I would like to do the same
> think directly into a zraid pool, but I understand
> from what I have read that you cant add single drives
> to a zraid and I want all my drives in a single pool
> as only to loose the space for the pool  redundancy
> once.
> 
> I havent worked out if I can transform my zpool int a
> zraid after I have copied all my data.
> 
> Once again thx for the great support. And maybe
> someone can direct me to an area in a forum that
> explains y I cant use sudo...

Hope this helps
http://forums.opensolaris.com/thread.jspa?threadID=583&tstart=-1
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss