Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Gray Carper
Hey there, Bob!

Looks like you and Akhilesh (thanks, Akhilesh!) are driving at a similar,
very valid point. I'm currently using the default recordsize (128K) on all
of the ZFS pool (those of the iSCSI target nodes and the aggregate pool on
the head node).

I should've mentioned something about how the storage will be used in my
original post, so I'm glad you brought it up. It will all be presented over
NFS and CIFS as a 10GBe+Infiniband NAS which will serve a number of
organizations. Some organizations will simply use their area for end-user
file sharing, others will use it as a disk backup target, others for
databases, and still others for HPC data crunching (gene sequences). Each of
these uses will be on different filesystems, of course, so I expect it would
be good to set different recordsize paramaters for each one. Do you have any
suggestions on good starting sizes for each? I'd imagine filesharing might
benefit from a relatively small record size (64K?), image-based backup
targets might like a pretty large record size (256K?), databases just need
recordsizes to match their block sizes, and HPC...I have no idea. Heh. I
expect I'll need to get in contact with the HPC lab to see what kind of
profile they have (whether they deal with tiny files or big files, etc).
What do you think?

Today I'm going to try a few non-ZFS-related tweaks (disabling the Nagle
algorithm on the iSCSI initiator and increasing MTU everywhere to 9000).
I'll give those a shot and see if they yield performance enhancements.

-Gray

On Tue, Oct 14, 2008 at 10:36 PM, Bob Friesenhahn <
[EMAIL PROTECTED]> wrote:

> On Tue, 14 Oct 2008, Gray Carper wrote:
>
>>
>> So, how concerned should we be about the low scores here and there? Any
>> suggestions on how to improve our configuration? And how excited should we
>> be about the 8GB tests? ;>
>>
>
> The level of concern should depend on how you expect your storage pool to
> actually be used.  It seems that it should work great for bulk storage, but
> not to support a database, or ultra high-performance super-computing
> applications.  The good 8GB performance is due to successful ZFS ARC caching
> in RAM, and because the record size is reasonable given the ZFS block size
> and the buffering ability of the intermediate links.  You might see somewhat
> better performance using a 256K record size.
>
> It may take quite a while to fill 150TB up.
>
> Bob
> ==
> Bob Friesenhahn
> [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
>
>


-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
[EMAIL PROTECTED]  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] HELP! SNV_97,98,99 zfs with iscsitadm and VMWare!

2008-10-14 Thread Tano
I'm not sure if this is a problem with the iscsitarget or zfs. I'd greatly 
appreciate it if it gets moved to the proper list.

Well I'm just about out of ideas on what might be wrong..

Quick history:

I installed OS 2008.05 when it was SNV_86 to try out ZFS with VMWare. Found out 
that multilun's were being treated as multipaths so waited till SNV_94 came out 
to fix the issues with VMWARE and iscsitadm/zfs shareiscsi=on.

I Installed OS2008.05 on a virtual machine as a test bed, pkg image-update to 
SNV_94 a month ago, made some thin provisioned partitions, shared them with 
iscsitadm and mounted on VMWare without any problems. Ran storage VMotion and 
all went well.

So with this success I purchased a Dell 1900 with a PERC 5/i controller 6 x 15K 
SAS DRIVEs with ZFS RAIDZ1 configuration. I shared the zfs partitions and 
mounted them on VMWare. Everything is great till I have to write to the disks.

It won't write!


Steps I took creating the disks

1) Installed mega_sas drivers.
2) zpool create tank raidz c5t0d0 c5t1d0 c5t2d0 c5t3d0 c5t4d0 c5t5d0
3) zfs create -V 1TB tank/disk1
4) zfs create -V 1TB tank/disk2
5) iscsitadm create target -b /dev/zvol/rdsk/tank/disk1 LABEL1
6) iscsitadm create target -b /dev/zvol/rdsk/tank/disk2 LABEL2

Now both drives are lun 0 but with uniqu VMHBA device identifiers. SO they are 
detected as seperate drives.

I then redid (deleted) step 5 and 6 and changed it too

5) iscsitadm create target -u 0 -b /dev/zvol/rdsk/tank/disk1 LABEL1
6) iscsitadm create target -u 1 -b /dev/zvol/rdsk/tank/disk2 LABEL1

VMWARE discovers the seperate LUNs on the Device identifier, but still unable 
to write to the iscsi luns.

Why is it that the steps I've conducted in SNV_94 work but in SNV_97,98, or 99 
don't.

Any ideas?? any log files I can check? I am still an ignorant linux user so I 
only know to look in /var/log :)
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool CKSUM errors since drive replace

2008-10-14 Thread Mark J Musante
> So this is where I stand.  I'd like to ask zfs-discuss if they've seen any 
> ZIL/Replay style bugs associated with u3/u5 x86?  Again, I'm confident in my 
> hardware, and /var/adm/messages is showing no warnings/errors.

Are you absolutely sure the hardware is OK?  Is there another disk you can test 
in its place?  If I read your post correctly, your first disk was having errors 
logged against it, and now the second disk -- plugged into the same port -- is 
also logging errors.

This seems to me more like the port is bad.  Is there a third disk you can try 
in that same port?

I have a hard time seeing that this could be a zfs bug - I've been doing lots 
of testing on u5 and the only time I see checksum errors is when I deliberately 
induce them.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread James C. McPherson
Erast Benson wrote:
> James, all serious ZFS bug fixes back-ported to b85 as well as marvell
> and other sata drivers. Not everything is possible to back-port of
> course, but I would say all critical things are there. This includes ZFS
> ARC optimization patches, for example.

Excellent!


James
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Segmentation fault / core dump with recursive

2008-10-14 Thread BJ Quinn
Well, I haven't solved everything yet, but I do feel better now that I realize 
that it was setting moutpoint=none that caused the zfs send/recv to hang.  
Allowing the default mountpoint setting fixed that problem.  I'm now trying 
with moutpoint=legacy, because I'd really rather leave it unmounted, especially 
during the backup itself, to prevent changes happening while the incrementals 
are copying over, and also in the end to hopefully let me avoid using -F.

The incrementals (copying all the snapshots beyond the first one copied) are 
really slow, however.  Is there anything that can be done to speed that up?  
I'm using compression (gzip-1) on the source filesystem.  I wanted the backup 
to retain the same compression.  Can ZFS copy the compressed version over to 
the backup, or does it really have to uncompress it and recompress it?  That 
takes time and lots of CPU cycles.  I'm dealing with highly compressible data 
(at least 6.5:1).
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] L2ARC on iSER ramdisks?

2008-10-14 Thread Joerg Moellenkamp

Hello

the idea introducd by  Chris Greer to use servers as solid state disks  
kept my brain busy the last days. Perhaps it makes sense to put L2ARC  
devices in memory as well to increase the in-memory-part of a database  
in excess of the capacity of a single server. I already wrote about  
the idea in my blog at http://www.c0t0d0s0.org/archives/4919-L2ARC-on-ramdisks.html 
 so i will just copy-and-paste the article here:


"I thought a little bit about the idea of transforming server into  
solid state disks. The idea in the mail of Chris Greer on zfs-discuss  
was to use mirrored iSCSI shared ramdisks as a storage for the  
seperated ZILs. But i think you could use the concept as well for  
L2ARC as well - e.g. for large databases. One of the sizing rules of  
databases: More main memory never hurts. Nothing helps the performance  
of a database more than even more memory. The rule of "main memory  
never hurts" is based on the fact, that a hard disk has only a few  
IOPS compared with the main memory and hard drive access massively  
hurts the performance of your database.


But obviously the size of memory is limited, albeit the this limit is  
quite high with systems with memory sizes in the range of 512 GB on 4  
rack units. But how can you get more memory into your database system,  
when all DIMM slots are filled with the biggest available DIMMS.


I had an idea while cooking tea this evening while i thought about a  
discussion with a colleague: Let´s assume an architecture based on a  
X4600 as a head in front of four X4600 each fully maxed to 512GB. All  
the nodes are connected with Infiniband. The first X4600 is your  
normal database server (for example mysql or LarryBase). You put your  
data into an ZFS storage pool. This storage pool is augumented with  
L2ARC devices. But now comes the plot twist. Let´s use the 512GB X4600  
as huge ramdisks (yes, i know, every engineers heart will crying now)  
speaking via iSER (no TCP/IP, just RDMA) at 20 GBit/s to the central  
database node. This would give you a cache in the size of almost 2 TB  
plus the cache on the database server itself.. By using L2ARC you  
could use the memory as database caches of other systems without using  
a database doing a combination of the memory resources by other means,  
for example the CacheFusion stuff of Oracle. You don´t have to fuse  
the caches of other databases servers. The other servers are caches.  
You don´t have to partition the databases.


It would be interesting how such an system would perform in  
comparision to a Oracle RAC or other memory implementations. Anybody  
out there willing to test this ... my Infiniband switches are in the  
laundry at the moment ;-)"


I hope this idea is not complete nonsense ...

Regards
 Joerg

--
Joerg Moellenkamp Tel: (+49 40) 25 15 23 - 460
Senior Systems Engineer   Fax: (+49 40) 25 15 23 - 425
Sun Microsystems GmbH   Mobile: (+49 172) 83 18 433
Nagelsweg 55 mailto:[EMAIL PROTECTED]
D-20097 Hamburg   http://www.sun.de

Sitz der Gesellschaft:   Sun Microsystems GmbH
 Sonnenallee 1
 D-85551 Kirchheim-Heimstetten
Amtsgericht München: HRB 161028
Geschäftsführer:Thomas Schröder
  Wolfgang Engels
  Dr. Roland Bömer
Vorsitzender des Aufsichtsrates: Martin Häring

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Brent Jones
On Tue, Oct 14, 2008 at 12:31 AM, Gray Carper <[EMAIL PROTECTED]> wrote:
> Hey, all!
>
> We've recently used six x4500 Thumpers, all publishing ~28TB iSCSI targets 
> over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on an x4200 
> head node. In trying to discover optimal ZFS pool construction settings, 
> we've run a number of iozone tests, so I thought I'd share them with you and 
> see if you have any comments, suggestions, etc.
>
> First, on a single Thumper, we ran baseline tests on the direct-attached 
> storage (which is collected into a single ZFS pool comprised of four raidz2 
> groups)...
>
> [1GB file size, 1KB record size]
> Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest
> Write: 123919
> Rewrite: 146277
> Read: 383226
> Reread: 383567
> Random Read: 84369
> Random Write: 121617
>
> [8GB file size, 512KB record size]
> Command:
> Write:  373345
> Rewrite:  665847
> Read:  2261103
> Reread:  2175696
> Random Read:  2239877
> Random Write:  666769
>
> [64GB file size, 1MB record size]
> Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest
> Write: 517092
> Rewrite: 541768
> Read: 682713
> Reread: 697875
> Random Read: 89362
> Random Write: 488944
>
> These results look very nice, though you'll notice that the random read 
> numbers tend to be pretty low on the 1GB and 64GB tests (relative to their 
> sequential counterparts), but the 8GB random (and sequential) read is 
> unbelievably good.
>
> Now we move to the head node's iSCSI aggregate ZFS pool...
>
> [1GB file size, 1KB record size]
> Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f 
> /volumes/data-iscsi/perftest/1gbtest
> Write:  127108
> Rewrite:  120704
> Read:  394073
> Reread:  396607
> Random Read:  63820
> Random Write:  5907
>
> [8GB file size, 512KB record size]
> Command: iozone -i 0 -i 1 -i 2 -r 512 -s 8g -f 
> /volumes/data-iscsi/perftest/8gbtest
> Write:  235348
> Rewrite:  179740
> Read:  577315
> Reread:  662253
> Random Read:  249853
> Random Write:  274589
>
> [64GB file size, 1MB record size]
> Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f 
> /volumes/data-iscsi/perftest/64gbtest
> Write:  190535
> Rewrite:  194738
> Read:  297605
> Reread:  314829
> Random Read:  93102
> Random Write:  175688
>
> Generally speaking, the results look good, but you'll notice that random 
> writes are atrocious on the 1GB tests and random reads are not so great on 
> the 1GB and 64GB tests, but the 8GB test looks great across the board. 
> Voodoo! ;> Incidentally, I ran all these tests against the ZFS pool in disk, 
> raidz1, and raidz2 modes - there were no significant changes in the results.
>
> So, how concerned should we be about the low scores here and there? Any 
> suggestions on how to improve our configuration? And how excited should we be 
> about the 8GB tests? ;>
>
> Thanks so much for any input you have!
> -Gray
> ---
> University of Michigan
> Medical School Information Services
> --
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Your setup sounds very interesting how you export iSCSI to another
head unit, can you give me some more details on your file system
layout, and how you mount it on the head unit?
Sounds like a pretty clever way to export awesomely large volumes!

Regards,

-- 
Brent Jones
[EMAIL PROTECTED]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Change the volblocksize of a ZFS volume

2008-10-14 Thread Richard Elling
Nick Smith wrote:
> Dear all,
>
> Background:
>
> I have a ZFS volume with the incorrect volume blocksize for the filesystem 
> (NTFS) that it is supporting. 
>
> This volume contains important data that is proving impossible to copy using 
> Windows XP Xen HVM that "owns" the data.
>
> The disparity in volume blocksize (current set to 512bytes!!) is causing 
> significant performance problems.
>
> Question :
>
> Is there a way to change the volume blocksize say via 'zfs snapshot 
> send/receive'?
>
> As I see things, this isn't possible as the target volume (including property 
> values) gets overwritten by 'zfs receive'.
>   

By default, properties are not received.  To pass properties, you need 
to use
the -R flag.  For examples, see the ZFS Administration Guide,
http://www.opensolaris.org/os/community/zfs/docs/zfsadmin.pdf
 -- richard

> Many Thanks for any help.
>
> Nick Smith
> --
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Erast Benson
James, all serious ZFS bug fixes back-ported to b85 as well as marvell
and other sata drivers. Not everything is possible to back-port of
course, but I would say all critical things are there. This includes ZFS
ARC optimization patches, for example.

On Tue, 2008-10-14 at 22:33 +1000, James C. McPherson wrote:
> Gray Carper wrote:
> > Hey there, James!
> > 
> > We're actually running NexentaStor v1.0.8, which is based on b85. We 
> > haven't done any tuning ourselves, but I suppose it is possible that 
> > Nexenta did. If there's something specific you'd like me to look for, 
> > I'd be happy to.
> 
> Hi Gray,
> So build 85 that's getting a bit long in the tooth now.
> 
> I know there have been *lots* of ZFS, Marvell SATA and iSCSI
> fixes and enhancements since then which went into OpenSolaris.
> I know they're in Solaris Express and the updated binary distro
> form of os2008.05 - I just don't know whether Erast and the
> Nexenta clan have included them in what they are releasing as 1.0.8.
> 
> Erast - could you chime in here please? Unfortunately I've got no
> idea about Nexenta.
> 
> 
> James C. McPherson
> --
> Senior Kernel Software Engineer, Solaris
> Sun Microsystems
> http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Bob Friesenhahn
On Tue, 14 Oct 2008, Gray Carper wrote:
>
> So, how concerned should we be about the low scores here and there? 
> Any suggestions on how to improve our configuration? And how excited 
> should we be about the 8GB tests? ;>

The level of concern should depend on how you expect your storage pool 
to actually be used.  It seems that it should work great for bulk 
storage, but not to support a database, or ultra high-performance 
super-computing applications.  The good 8GB performance is due to 
successful ZFS ARC caching in RAM, and because the record size is 
reasonable given the ZFS block size and the buffering ability of the 
intermediate links.  You might see somewhat better performance using a 
256K record size.

It may take quite a while to fill 150TB up.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Change the volblocksize of a ZFS volume

2008-10-14 Thread Nick Smith
Dear all,

Background:

I have a ZFS volume with the incorrect volume blocksize for the filesystem 
(NTFS) that it is supporting. 

This volume contains important data that is proving impossible to copy using 
Windows XP Xen HVM that "owns" the data.

The disparity in volume blocksize (current set to 512bytes!!) is causing 
significant performance problems.

Question :

Is there a way to change the volume blocksize say via 'zfs snapshot 
send/receive'?

As I see things, this isn't possible as the target volume (including property 
values) gets overwritten by 'zfs receive'.

Many Thanks for any help.

Nick Smith
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Akhilesh Mritunjai
Just a random spectator here, but I think artifacts you're seeing here are not 
due to file size, but rather due to record size.

What is the ZFS record size ?

On a personal note, I wouldn't do non-concurrent (?) benchmarks. They are at 
best useless and at worst misleading for ZFS

- Akhilesh.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Gray Carper
Howdy!

Sounds good. We'll upgrade to 1.1 (b101) as soon as it is released, re-run
our battery of tests, and see where we stand.

Thanks!
-Gray

On Tue, Oct 14, 2008 at 8:47 PM, James C. McPherson <[EMAIL PROTECTED]
> wrote:

> Gray Carper wrote:
>
>> Hello again! (And hellos to Erast, who has been a huge help to me many,
>> many times! :>)
>>
>> As I understand it, Nexenta 1.1 should be released in a matter of weeks
>> and it'll be based on build 101. We are waiting for that with baited breath,
>> since it includes some very important Active Directory integration fixes,
>> but this sounds like another reason to be excited about it. Maybe this is a
>> discussion that should be tabled until we are able to upgrade?
>>
>
> Yup, I think that's probably the best thing. And thanks
> for passing on the info about the 1.1 release, I'll keep
> that in my back pocket :)
>
>
> cheers,
> James
>
> --
> Senior Kernel Software Engineer, Solaris
> Sun Microsystems
> http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
>



-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
[EMAIL PROTECTED]  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread James C. McPherson
Gray Carper wrote:
> Hello again! (And hellos to Erast, who has been a huge help to me many, 
> many times! :>)
> 
> As I understand it, Nexenta 1.1 should be released in a matter of weeks 
> and it'll be based on build 101. We are waiting for that with baited 
> breath, since it includes some very important Active Directory 
> integration fixes, but this sounds like another reason to be excited 
> about it. Maybe this is a discussion that should be tabled until we are 
> able to upgrade?

Yup, I think that's probably the best thing. And thanks
for passing on the info about the 1.1 release, I'll keep
that in my back pocket :)


cheers,
James
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Gray Carper
Hello again! (And hellos to Erast, who has been a huge help to me many, many
times! :>)

As I understand it, Nexenta 1.1 should be released in a matter of weeks and
it'll be based on build 101. We are waiting for that with baited breath,
since it includes some very important Active Directory integration fixes,
but this sounds like another reason to be excited about it. Maybe this is a
discussion that should be tabled until we are able to upgrade?

-Gray

On Tue, Oct 14, 2008 at 8:33 PM, James C. McPherson <[EMAIL PROTECTED]
> wrote:

> Gray Carper wrote:
>
>> Hey there, James!
>>
>> We're actually running NexentaStor v1.0.8, which is based on b85. We
>> haven't done any tuning ourselves, but I suppose it is possible that Nexenta
>> did. If there's something specific you'd like me to look for, I'd be happy
>> to.
>>
>
> Hi Gray,
> So build 85 that's getting a bit long in the tooth now.
>
> I know there have been *lots* of ZFS, Marvell SATA and iSCSI
> fixes and enhancements since then which went into OpenSolaris.
> I know they're in Solaris Express and the updated binary distro
> form of os2008.05 - I just don't know whether Erast and the
> Nexenta clan have included them in what they are releasing as 1.0.8.
>
> Erast - could you chime in here please? Unfortunately I've got no
> idea about Nexenta.
>
>
>
> James C. McPherson
> --
> Senior Kernel Software Engineer, Solaris
> Sun Microsystems
> http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
>



-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread James C. McPherson
Gray Carper wrote:
> Hey there, James!
> 
> We're actually running NexentaStor v1.0.8, which is based on b85. We 
> haven't done any tuning ourselves, but I suppose it is possible that 
> Nexenta did. If there's something specific you'd like me to look for, 
> I'd be happy to.

Hi Gray,
So build 85 that's getting a bit long in the tooth now.

I know there have been *lots* of ZFS, Marvell SATA and iSCSI
fixes and enhancements since then which went into OpenSolaris.
I know they're in Solaris Express and the updated binary distro
form of os2008.05 - I just don't know whether Erast and the
Nexenta clan have included them in what they are releasing as 1.0.8.

Erast - could you chime in here please? Unfortunately I've got no
idea about Nexenta.


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Gray Carper
Hey there, James!

We're actually running NexentaStor v1.0.8, which is based on b85. We haven't
done any tuning ourselves, but I suppose it is possible that Nexenta did. If
there's something specific you have in mind, I'd be happy to look for it.

Thanks!
-Gray

On Tue, Oct 14, 2008 at 8:10 PM, James C. McPherson <[EMAIL PROTECTED]
> wrote:

> Gray Carper wrote:
>
>> Hey, all!
>>
>> We've recently used six x4500 Thumpers, all publishing ~28TB iSCSI
>> targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on
>> an x4200 head node. In trying to discover optimal ZFS pool construction
>> settings, we've run a number of iozone tests, so I thought I'd share them
>> with you and see if you have any comments, suggestions, etc.
>>
>
> [snip]
>
>
> Which build are you running? Have you done any system
> or ZFS tuning?
>
>
> James C. McPherson
> --
> Senior Kernel Software Engineer, Solaris
> Sun Microsystems
> http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
>



-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread James C. McPherson
Gray Carper wrote:
> Hey, all!
> 
> We've recently used six x4500 Thumpers, all publishing ~28TB iSCSI
> targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on
> an x4200 head node. In trying to discover optimal ZFS pool construction
> settings, we've run a number of iozone tests, so I thought I'd share them
> with you and see if you have any comments, suggestions, etc.

[snip]


Which build are you running? Have you done any system
or ZFS tuning?


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS confused about disk controller

2008-10-14 Thread Caryl Takvorian
For the sake of completeness, in the end I simply created links in 
/dev/rdsk for c1t0d0sX to point to my disk and was able to reactivate 
the current BE.

The shroud of mystery hasn't lifted though because when I did eventually 
reboot, I performed a reconfigure (boot -r) and the format and cfgadm 
commands now report my disk to be attached to the c1 controller.

I don't know how or why Solaris can go from detecting C1 to C5 and back 
to C1 again, but at least now ZFS and format agrees on where my disk is.

Maybe the last time I did a reconfigure was when I booted from the LiveCD ?


Caryl



On 10/07/08 11:58, Caryl Takvorian wrote:
> Hi all,
>
> Please keep me on cc: since I am not subscribed to either lists.
>
>
> I have a weird problem with my OpenSolaris 2008.05 installation (build 
> 96) on my Ultra 20 workstation.
> For some reason, ZFS has been confused and has recently starting to 
> believe that my zpool is using a device which  does not exist !
>
> prodigal:zfs #zpool status
>  pool: rpool
> state: ONLINE
> scrub: none requested
> config:
>
>NAMESTATE READ WRITE CKSUM
>rpool   ONLINE   0 0 0
>  c1t0d0s0  ONLINE   0 0 0
>
> errors: No known data errors
>
>
> The c1t0d0s0 device doesn't exist on my system. Instead, my disk is 
> attached to c5t0d0s0 as shown by
>
> prodigal:zfs #format
> Searching for disks...done
>
>
> AVAILABLE DISK SELECTIONS:
>   0. c5t0d0 
>  /[EMAIL PROTECTED],0/pci108e,[EMAIL PROTECTED]/[EMAIL PROTECTED],0
>
> or
>
> prodigal:zfs #cfgadmAp_Id  Type 
> Receptacle   Occupant Condition
> sata0/0::dsk/c5t0d0disk connectedconfigured   ok
>
>
>
> What is really annoying is that I attempted to update my current 
> OpenSolaris build 96 to the latest (b98)  by using
>
> # pkg image-update
>
> The update went well, and at the end it selected the new BE to be 
> activated upon reboot, but failed when attempting to modify the grub 
> entry because install_grub asks ZFS what is my boot device and gets 
> back the wrong device (of course, I am using ZFS as my root 
> filesystem, otherwise it wouldn't be fun).
>
> When I manually try to run install_grub, this is the error message I get:
>
> prodigal:zfs #/tmp/tmpkkEF1W/boot/solaris/bin/update_grub -R 
> /tmp/tmpkkEF1W
> Creating GRUB menu in /tmp/tmpkkEF1W
> bootadm: fstyp -a on device /dev/rdsk/c1t0d0s0 failed
> bootadm: failed to get pool for device: /dev/rdsk/c1t0d0s0
> bootadm: fstyp -a on device /dev/rdsk/c1t0d0s0 failed
> bootadm: failed to get pool name from /dev/rdsk/c1t0d0s0
> bootadm: failed to create GRUB boot signature for device: 
> /dev/rdsk/c1t0d0s0
> bootadm: failed to get grubsign for root: /tmp/tmpkkEF1W, device 
> /dev/rdsk/c1t0d0s0
> Installing grub on /dev/rdsk/c1t0d0s0
> cannot open/stat device /dev/rdsk/c1t0d0s2
>
>
> The worst bit, is that now beadm refuses to reactivate my current 
> running OS to be used upon the next reboot.
> So, the next time I reboot, my system is probably never going to come 
> back up.
>
>
> prodigal:zfs #beadm list
>
> BEActive Mountpoint Space   Policy Created 
> ---- -- -   -- --- 
> opensolaris-5 N  /  128.50M static 2008-09-09 13:03
> opensolaris-6 R  /tmp/tmpkkEF1W 52.19G  static 2008-10-07 10:14
>
>
> prodigal:zfs #export BE_PRINT_ERR=true
> prodigal:zfs #beadm activate opensolaris-5
> be_do_installgrub: installgrub failed for device c1t0d0s0.
> beadm: Unable to activate opensolaris-5
>
>
> So, how can I force zpool to accept that my disk device really is on 
> c5t0d0s0 and forget about c1?
>
> Since the file /etc/zfs/zpool.cache contains a reference to 
> /dev/dsk/c1t0d0s0  I have rebuilt the boot_archive after removing it 
> from the ramdisk, but I've got cold feet about rebooting without 
> confirmation.
>
>
> Has anyone seen this before or has any idea how to fix this situation ?
>
>
> Thanks
>
>
> Caryl
>
>

-- 
~~~
Caryl Takvorian [EMAIL PROTECTED]
ISV Engineering phone : +44 (0)1252 420 686
Sun Microsystems UK mobile: +44 (0)771 778 5646



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving zfs send performance

2008-10-14 Thread Thomas Maier-Komor
Carsten Aulbert schrieb:
> Hi again,
> 
> Thomas Maier-Komor wrote:
>> Carsten Aulbert schrieb:
>>> Hi Thomas,
>> I don't know socat or what benefit it gives you, but have you tried
>> using mbuffer to send and receive directly (options -I and -O)?
> 
> I thought we tried that in the past and with socat it seemed faster, but
> I just made a brief test and I got (/dev/zero -> remote /dev/null) 330
> MB/s with mbuffer+socat and 430MB/s with mbuffer alone.
> 
>> Additionally, try to set the block size of mbuffer to the recordsize of
>> zfs (usually 128k):
>> receiver$ mbuffer -I sender:1 -s 128k -m 2048M | zfs receive
>> sender$ zfs send blabla | mbuffer -s 128k -m 2048M -O receiver:1
> 
> We are using 32k since many of our user use tiny files (and then I need
> to reduce the buffer size because of this 'funny' error):
> 
> mbuffer: fatal: Cannot address so much memory
> (32768*65536=2147483648>1544040742911).
> 
> Does this qualify for a bug report?
> 
> Thanks for the hint of looking into this again!
> 
> Cheers
> 
> Carsten
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Yes this qualifies for a bug report. As a workaround for now, you can
compile in 64 bit mode.
I.e.:
$ ./configure CFLAGS="-g -O -m64"
$ make && make install

This works for Sun Studio 12 and gcc. For older version of Sun Studio,
you need to pass -xarch=v9 instead of -m64.

I am planning to release an updated version mbuffer this week. I'll
include a patch for this issue.

Cheers,
Thomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving zfs send performance

2008-10-14 Thread Carsten Aulbert
Hi again,

Thomas Maier-Komor wrote:
> Carsten Aulbert schrieb:
>> Hi Thomas,
> I don't know socat or what benefit it gives you, but have you tried
> using mbuffer to send and receive directly (options -I and -O)?

I thought we tried that in the past and with socat it seemed faster, but
I just made a brief test and I got (/dev/zero -> remote /dev/null) 330
MB/s with mbuffer+socat and 430MB/s with mbuffer alone.

> Additionally, try to set the block size of mbuffer to the recordsize of
> zfs (usually 128k):
> receiver$ mbuffer -I sender:1 -s 128k -m 2048M | zfs receive
> sender$ zfs send blabla | mbuffer -s 128k -m 2048M -O receiver:1

We are using 32k since many of our user use tiny files (and then I need
to reduce the buffer size because of this 'funny' error):

mbuffer: fatal: Cannot address so much memory
(32768*65536=2147483648>1544040742911).

Does this qualify for a bug report?

Thanks for the hint of looking into this again!

Cheers

Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Gray Carper
Hey, all!

We've recently used six x4500 Thumpers, all publishing ~28TB iSCSI targets over 
ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on an x4200 head node. 
In trying to discover optimal ZFS pool construction settings, we've run a 
number of iozone tests, so I thought I'd share them with you and see if you 
have any comments, suggestions, etc.

First, on a single Thumper, we ran baseline tests on the direct-attached 
storage (which is collected into a single ZFS pool comprised of four raidz2 
groups)...

[1GB file size, 1KB record size]
Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest 
Write: 123919 
Rewrite: 146277 
Read: 383226 
Reread: 383567 
Random Read: 84369 
Random Write: 121617 

[8GB file size, 512KB record size]
Command:  
Write:  373345
Rewrite:  665847
Read:  2261103
Reread:  2175696
Random Read:  2239877
Random Write:  666769

[64GB file size, 1MB record size]
Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest 
Write: 517092
Rewrite: 541768 
Read: 682713
Reread: 697875
Random Read: 89362
Random Write: 488944

These results look very nice, though you'll notice that the random read numbers 
tend to be pretty low on the 1GB and 64GB tests (relative to their sequential 
counterparts), but the 8GB random (and sequential) read is unbelievably good.

Now we move to the head node's iSCSI aggregate ZFS pool...

[1GB file size, 1KB record size]
Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f 
/volumes/data-iscsi/perftest/1gbtest
Write:  127108
Rewrite:  120704
Read:  394073
Reread:  396607
Random Read:  63820
Random Write:  5907

[8GB file size, 512KB record size]
Command: iozone -i 0 -i 1 -i 2 -r 512 -s 8g -f 
/volumes/data-iscsi/perftest/8gbtest
Write:  235348
Rewrite:  179740
Read:  577315
Reread:  662253
Random Read:  249853
Random Write:  274589

[64GB file size, 1MB record size]
Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f 
/volumes/data-iscsi/perftest/64gbtest 
Write:  190535
Rewrite:  194738
Read:  297605
Reread:  314829
Random Read:  93102
Random Write:  175688

Generally speaking, the results look good, but you'll notice that random writes 
are atrocious on the 1GB tests and random reads are not so great on the 1GB and 
64GB tests, but the 8GB test looks great across the board. Voodoo! ;> 
Incidentally, I ran all these tests against the ZFS pool in disk, raidz1, and 
raidz2 modes - there were no significant changes in the results.

So, how concerned should we be about the low scores here and there? Any 
suggestions on how to improve our configuration? And how excited should we be 
about the 8GB tests? ;> 

Thanks so much for any input you have!
-Gray
---
University of Michigan
Medical School Information Services
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving zfs send performance

2008-10-14 Thread Thomas Maier-Komor
Carsten Aulbert schrieb:
> Hi Thomas,
> 
> Thomas Maier-Komor wrote:
> 
>> Carsten,
>>
>> the summary looks like you are using mbuffer. Can you elaborate on what
>> options you are passing to mbuffer? Maybe changing the blocksize to be
>> consistent with the recordsize of the zpool could improve performance.
>> Is the buffer running full or is it empty most of the time? Are you sure
>> that the network connection is 10Gb/s all the way through from machine
>> to machine?
> 
> Well spotted :)
> 
> right now plain mbuffer with plenty of buffer (-m 2048M) on both ends
> and I have not seen any buffer exceeding the 10% watermark level. The
> network connection are via Neterion XFrame II Sun Fire NICs then via CX4
> cables to our core switch where both boxes are directly connected
> (WovenSystmes EFX1000). netperf tells me that the TCP performance is
> close to 7.5 GBit/s duplex and if I use
> 
> cat /dev/zero | mbuffer | socat ---> socat | mbuffer > /dev/null
> 
> I easily see speeds of about 350-400 MB/s so I think the network is fine.
> 
> Cheers
> 
> Carsten
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

I don't know socat or what benefit it gives you, but have you tried
using mbuffer to send and receive directly (options -I and -O)?
Additionally, try to set the block size of mbuffer to the recordsize of
zfs (usually 128k):
receiver$ mbuffer -I sender:1 -s 128k -m 2048M | zfs receive
sender$ zfs send blabla | mbuffer -s 128k -m 2048M -O receiver:1

As transmitting from /dev/zero to /dev/null is at a rate of 350MB/s, I
guess, you are really hitting the maximum speed of your zpool. From my
understanding, I'd guess sending is always slower than receiving,
because reads are random and writes are sequential. So it should be
quite normal that mbuffer's buffer doesn't really see a lot of usage.

Cheers,
Thomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss