Re: [zfs-discuss] [storage-discuss] AVS on opensolaris 2008.11

2009-01-26 Thread Jim Dunham
Richard Elling wrote:
> Jim Dunham wrote:
>> Ahmed,
>>
>>> The setup is not there anymore, however, I will share as much  
>>> details
>>> as I have documented. Could you please post the commands you have  
>>> used
>>> and any differences you think might be important. Did you ever test
>>> with 2008.11 ? instead of sxce ?
>>>
>>
>> Specific to the following:
>>
> While we should be getting minimal performance hit (hopefully),  
> we  got
> a big performance hit, disk throughput was reduced to almost 10%  
> of
> the normal rate.
>
>>
>> It looks like I need to test on OpenSoalris 2008.11, not Solaris   
>> Express CE (b105), since this version does not have access to a   
>> version of 'dd' with a  oflag= setting.
>>
>> # dd if=/dev/zero of=/dev/zvol/rdsk/gold/xxVolNamexx oflag=dsync   
>> bs=256M count=10
>> dd: bad argument: "oflag=dsync
>>
>
> Congratulations!  You've been bit by the gnu-compatibility feature!

Oh that's what one calls it... a feature?

> SXCE and OpenSolaris have more than one version of dd.  The difference
> is that OpenSolaris sets your default PATH to use /usr/gnu/bin/dd,  
> which
> has the oflag option, while SXCE sets your default PATH to use /usr/ 
> bin/dd.

Thank you,

Jim

>
> -- richard
>
>> Using a setting of 'oflag=dsync' will have performance implications.
>>
>> Also there is an issue with an I/O of size bs=256M. SNDR's  
>> internal  architecture has a I/O unit chunk size of one bit in  
>> 32KB". Therefore  when doing an I/O of 256MB, this results in the  
>> need to set 8192 bits,  1024 bytes, or 1KB of data with 0xFF.   
>> Although testing with an /O  size of 256MB is interesting, typical  
>> I/O tests are more like the  following: 
>> http://www.opensolaris.org/os/community/performance/filebench/quick_start/
>>
>> - Jim
>>
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] AVS on opensolaris 2008.11

2009-01-26 Thread Richard Elling
Jim Dunham wrote:
> Ahmed,
>
>   
>> The setup is not there anymore, however, I will share as much details
>> as I have documented. Could you please post the commands you have used
>> and any differences you think might be important. Did you ever test
>> with 2008.11 ? instead of sxce ?
>> 
>
> Specific to the following:
>
>   
 While we should be getting minimal performance hit (hopefully), we  
 got
 a big performance hit, disk throughput was reduced to almost 10% of
 the normal rate.
 
>
> It looks like I need to test on OpenSoalris 2008.11, not Solaris  
> Express CE (b105), since this version does not have access to a  
> version of 'dd' with a  oflag= setting.
>
> # dd if=/dev/zero of=/dev/zvol/rdsk/gold/xxVolNamexx oflag=dsync  
> bs=256M count=10
> dd: bad argument: "oflag=dsync
>   

Congratulations!  You've been bit by the gnu-compatibility feature!
SXCE and OpenSolaris have more than one version of dd.  The difference
is that OpenSolaris sets your default PATH to use /usr/gnu/bin/dd, which
has the oflag option, while SXCE sets your default PATH to use /usr/bin/dd.
 -- richard

> Using a setting of 'oflag=dsync' will have performance implications.
>
> Also there is an issue with an I/O of size bs=256M. SNDR's internal  
> architecture has a I/O unit chunk size of one bit in 32KB". Therefore  
> when doing an I/O of 256MB, this results in the need to set 8192 bits,  
> 1024 bytes, or 1KB of data with 0xFF.  Although testing with an /O  
> size of 256MB is interesting, typical I/O tests are more like the  
> following: 
> http://www.opensolaris.org/os/community/performance/filebench/quick_start/
>
> - Jim
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] AVS on opensolaris 2008.11

2009-01-26 Thread Jim Dunham
Ahmed,

> The setup is not there anymore, however, I will share as much details
> as I have documented. Could you please post the commands you have used
> and any differences you think might be important. Did you ever test
> with 2008.11 ? instead of sxce ?

Specific to the following:

>>> While we should be getting minimal performance hit (hopefully), we  
>>> got
>>> a big performance hit, disk throughput was reduced to almost 10% of
>>> the normal rate.

It looks like I need to test on OpenSoalris 2008.11, not Solaris  
Express CE (b105), since this version does not have access to a  
version of 'dd' with a  oflag= setting.

# dd if=/dev/zero of=/dev/zvol/rdsk/gold/xxVolNamexx oflag=dsync  
bs=256M count=10
dd: bad argument: "oflag=dsync"

Using a setting of 'oflag=dsync' will have performance implications.

Also there is an issue with an I/O of size bs=256M. SNDR's internal  
architecture has a I/O unit chunk size of one bit in 32KB". Therefore  
when doing an I/O of 256MB, this results in the need to set 8192 bits,  
1024 bytes, or 1KB of data with 0xFF.  Although testing with an /O  
size of 256MB is interesting, typical I/O tests are more like the  
following: 
http://www.opensolaris.org/os/community/performance/filebench/quick_start/

- Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] AVS on opensolaris 2008.11

2009-01-26 Thread Ahmed Kamal
Hi Jim,

The setup is not there anymore, however, I will share as much details
as I have documented. Could you please post the commands you have used
and any differences you think might be important. Did you ever test
with 2008.11 ? instead of sxce ?

I will probably be testing again soon. Any tips or obvious errors are welcome :)

->8-
The Setup
* A 100G zvol has been setup on each node of an AVS replicating pair
* A "ramdisk" has been setup on each node using
  ramdiskadm -a ram1 10m
* The replication relationship has been setup using
  sndradm -E pri /dev/zvol/rdsk/gold/myzvol /dev/rramdisk/ram1 sec
/dev/zvol/rdsk/gold/myzvol /dev/rramdisk/ram1 ip async
* The AVS driver was configured to not log the disk bitmap to disk,
rather to keep it in kernel memory and write it to disk only upon
machine shutdown. This is configured as such
  grep bitmap_mode /usr/kernel/drv/rdc.conf
  rdc_bitmap_mode=2;
* The replication was configured to be in logging mode
  sndradm -P
  /dev/zvol/rdsk/gold/myzvol  <-  pri:/dev/zvol/rdsk/gold/myzvol
  autosync: off, max q writes: 4096, max q fbas: 16384, async threads:
2, mode: async, state: logging

Testing was done with:

 dd if=/dev/zero of=/dev/zvol/rdsk/gold/xxVolNamexx oflag=dsync bs=256M count=10

* Option 'dsync' is chosen to try avoiding zfs's aggressive caching.
Moreover however, usually a couple of runs were launched initially to
fill the instant zfs cache and to force real writing to disk
* Option 'bs=256M' was used in order to avoid the overhead of copying
multiple small blocks to kernel memory before disk writes. A larger bs
size ensures max throughput. Smaller values were used without much
difference though

The results on multiple runs

Non Replicated Vol Throughputs: 42.2, 52.8, 50.9 MB/s
Replicated Vol Throughputs:  4.9, 5.5, 4.6 MB/s

-->8-

Regards

On Mon, Jan 26, 2009 at 1:22 AM, Jim Dunham  wrote:
> Ahmed,
>
>> Thanks for your informative reply. I am involved with kristof
>> (original poster) in the setup, please allow me to reply below
>>
>>> Was the follow 'test' run during resynchronization mode or replication
>>> mode?
>>>
>>
>> Neither, testing was done while in logging mode. This was chosen to
>> simply avoid any network "issues" and to get the setup working as fast
>> as possible. The setup was created with:
>>
>> sndradm -E pri /dev/zvol/rdsk/gold/myzvol /dev/rramdisk/ram1 sec
>> /dev/zvol/rdsk/gold/myzvol /dev/rramdisk/ram1 ip async
>>
>> Note that the logging disks are ramdisks again trying to avoid disk
>> contention and get fastest performance (reliability is not a concern
>> in this test). Before running the tests, this was the state
>>
>> #sndradm -P
>> /dev/zvol/rdsk/gold/myzvol  <-  pri:/dev/zvol/rdsk/gold/myzvol
>> autosync: off, max q writes: 4096, max q fbas: 16384, async threads:
>> 2, mode: async, state: logging
>>
>> While we should be getting minimal performance hit (hopefully), we got
>> a big performance hit, disk throughput was reduced to almost 10% of
>> the normal rate.
>
> Is it possible to share information on your ZFS storage pool configuration,
> your testing tool, testing types and resulting data?
>
> I just downloaded Solaris Express CE (b105)
> http://opensolaris.org/os/downloads/sol_ex_dvd_1/,  configured ZFS in
> various storage pool types, SNDR with and without RAM disks, and I do not
> see that disk throughput was reduced to almost 10% o the normal rate. Yes
> there is some performance impact, but no where near there amount reported.
>
> There are various factors which could come into play here, but the most
> obvious reason that someone may see a serious performance degradation as
> reported, is that prior to SNDR being configured, the existing system under
> test was already maxed out on some system limitation, such as CPU and
> memory.  I/O impact should not be a factor, given that a RAM disk is used.
> The addition of both SNDR and a RAM disk in the data, regardless of how
> small their system cost is, will have a profound impact on disk throughput.
>
> Jim
>
>>
>> Please feel free to ask for any details, thanks for the help
>>
>> Regards
>> ___
>> storage-discuss mailing list
>> storage-disc...@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/storage-discuss
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] AVS on opensolaris 2008.11

2009-01-25 Thread Jim Dunham
Ahmed,

> Thanks for your informative reply. I am involved with kristof
> (original poster) in the setup, please allow me to reply below
>
>> Was the follow 'test' run during resynchronization mode or  
>> replication
>> mode?
>>
>
> Neither, testing was done while in logging mode. This was chosen to
> simply avoid any network "issues" and to get the setup working as fast
> as possible. The setup was created with:
>
> sndradm -E pri /dev/zvol/rdsk/gold/myzvol /dev/rramdisk/ram1 sec
> /dev/zvol/rdsk/gold/myzvol /dev/rramdisk/ram1 ip async
>
> Note that the logging disks are ramdisks again trying to avoid disk
> contention and get fastest performance (reliability is not a concern
> in this test). Before running the tests, this was the state
>
> #sndradm -P
> /dev/zvol/rdsk/gold/myzvol  <-  pri:/dev/zvol/rdsk/gold/myzvol
> autosync: off, max q writes: 4096, max q fbas: 16384, async threads:
> 2, mode: async, state: logging
>
> While we should be getting minimal performance hit (hopefully), we got
> a big performance hit, disk throughput was reduced to almost 10% of
> the normal rate.

Is it possible to share information on your ZFS storage pool  
configuration, your testing tool, testing types and resulting data?

I just downloaded Solaris Express CE (b105) 
http://opensolaris.org/os/downloads/sol_ex_dvd_1/ 
,  configured ZFS in various storage pool types, SNDR with and without  
RAM disks, and I do not see that disk throughput was reduced to almost  
10% o the normal rate. Yes there is some performance impact, but no  
where near there amount reported.

There are various factors which could come into play here, but the  
most obvious reason that someone may see a serious performance  
degradation as reported, is that prior to SNDR being configured, the  
existing system under test was already maxed out on some system  
limitation, such as CPU and memory.  I/O impact should not be a  
factor, given that a RAM disk is used. The addition of both SNDR and a  
RAM disk in the data, regardless of how small their system cost is,  
will have a profound impact on disk throughput.

Jim

>
> Please feel free to ask for any details, thanks for the help
>
> Regards
> ___
> storage-discuss mailing list
> storage-disc...@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/storage-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] AVS on opensolaris 2008.11

2009-01-24 Thread Ahmed Kamal
Hi Jim,
Thanks for your informative reply. I am involved with kristof
(original poster) in the setup, please allow me to reply below

> Was the follow 'test' run during resynchronization mode or replication
> mode?
>

Neither, testing was done while in logging mode. This was chosen to
simply avoid any network "issues" and to get the setup working as fast
as possible. The setup was created with:

sndradm -E pri /dev/zvol/rdsk/gold/myzvol /dev/rramdisk/ram1 sec
/dev/zvol/rdsk/gold/myzvol /dev/rramdisk/ram1 ip async

Note that the logging disks are ramdisks again trying to avoid disk
contention and get fastest performance (reliability is not a concern
in this test). Before running the tests, this was the state

#sndradm -P
/dev/zvol/rdsk/gold/myzvol  <-  pri:/dev/zvol/rdsk/gold/myzvol
autosync: off, max q writes: 4096, max q fbas: 16384, async threads:
2, mode: async, state: logging

While we should be getting minimal performance hit (hopefully), we got
a big performance hit, disk throughput was reduced to almost 10% of
the normal rate.
Please feel free to ask for any details, thanks for the help

Regards
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] AVS on opensolaris 2008.11

2009-01-24 Thread Jim Dunham
Kristof,

> Jim Yes, in step 5 commands were executed on both nodes.
>
> We did some more tests with opensolaris 2008.11. (build 101b)
>
> We managed to get AVS setup up and running, but we noticed that  
> performance was really bad.
>
> When we configured a zfs volume for replication, we noticed that  
> write performance went down from 50 MB/s to 5 MB/sec.

SNDR replication has three modes of operation, and I/O performance  
varies quite differently for each one. They are:

1). Logging mode - As primary volume write I/Os occur, the bitmap  
volume is used to scoreboard unreplicated write I/Os, at which time  
the write I/O completes.

2). Resynchronization mode - A resynchronization thread traverses the  
scoreboard, in block order, replicating write I/Os for each bit set.  
Concurrently, as primary volume write I/Os occur, the bitmap volume is  
used to scoreboard unreplicated write I/Os. For write I/Os that occur  
(block order wise) after the resynchronization point, the write I/O  
completes. For writes I/Os the occur before the resynchronization  
point, they must be synchronously replicated in place. At the start of  
resynchronization, almost all write I/Os complete quickly, as they  
occur after the resynchronization point. As resynchronization nears  
completion, almost all write I/Os complete slowly, as they occur  
before the resynchronization point. When the resynchronization point  
reaches the end of the scoreboard, the SNDR primary and secondary  
volumes are now 100% identical, write-order consistent, and  
asynchronous replication begins.

3). Replication mode - Primary volume write I/Os are queue up to  
SNDR's memory queue (or optionally configured disk queue), and  
scoreboarded for replication, at which time the write I/O completes.  
In the back ground, multiple asynchronous flusher threads, dequeue  
unreplicated I/Os from  SNDR's memory or disk queue

On configurations with ample system resources, write performance for  
both logging mode and replication mode should be nearly identical.

The duration that a replica is in resynchronization mode is influence  
by the amount of write I/Os that occurred while the replica was in  
logging mode, the amount of primary volume write I/Os while  
resynchronization is also active, the network bandwidth and latency  
between primary and secondary nodes, and the I/O performance of the  
remote node's secondary volume.

First time synchronization, done after the SNDR enable "sndradm - 
e ..." is identical to resynchronization, except the bitmap volume is  
intentionally set to ALL ones, forcing every block to be replicated  
from primary to secondary. Now if one configured replication before  
the initial "zpool create" , the SNDR primary and secondary volumes  
both contain uninitialized data, and thus can be considered equal,  
therefore no synchronization is needed.  This is accomplished be using  
the "sndradm -E ..." option, setting the bitmap volume to ALL zeros.  
This means that the switch from logging mode, to replication mode is  
nearly instant.

If one has a ZFS storage pool, plus available storage that can be  
provisioned as zpool replacement volumes, these replacement volumes  
can be "sndradm -E ..", enabled first. Now when the "zpool  
replace ..." command is invoked, the write I/Os caused by ZFS to  
populate the replacement volume, will are cause SNDR to replicate only  
those write I/Os. This operation is done under SNDR's replication  
mode, not synchronization mode, and is also an ZFS background  
operations. Once the zpool replace is complete, the previously used  
storage can be reclaimed.


> A few notes about our test setup:
>
> *  Since replication is configured in logging mode, there is zero  
> network traffic
> *  Since rdc_bitmap_mode has been configured for memory, and even  
> more, since the bitmap device is a ramdisk. Any data IO on the  
> replicated volume, results only in a single memory bit flip (per 32k  
> disk space)
> * This setup is the bare minimum in the sense that the kernel driver  
> only hooks disk writes, and flips a bit in memory, it cannot go any  
> faster!

Was the follow 'test' run during resynchronization mode or replication  
mode?

> The Test
>
> * All tests were performed using the following command line
> # dd if=/dev/zero of=/dev/zvol/rdsk/gold/xxVolNamexx oflag=dsync  
> bs=256M count=10
>
> * Option 'dsync' is chosen to try avoiding zfs's aggressive caching.  
> Moreover however, usually a couple of runs were launched initially  
> to fill the instant zfs cache and to force real writing to disk
> * Option 'bs=256M' was used in order to avoid the overhead of  
> copying multiple small blocks to kernel memory before disk writes. A  
> larger bs size ensures max throughput. Smaller values were used  
> without much difference though
> -- 
> This message posted from opensolaris.org
> ___
> storage-discuss mailing list
> storage-disc...@opens