Re: [rhelv5-list] DS4300: high load in case of large writes

Jiri Novosad Mon, 29 Mar 2010 01:43:09 -0700

On 27.3.2010 23:21, Pasi Kärkkäinen wrote:
> On Thu, Mar 25, 2010 at 06:54:38PM +0100, [email protected] wrote:
>>
>> Hello Jiri,
>>
>> The high load may be caused by I/O wait (check with sar -u). At any case, 
>> 30mb/s seems a little slow for an FC array of any kind..
>>
Hi,


> 
> 30 MB/sec can be a LOT.. depending on the IO pattern and IO size.
> 
> If you're doing totally random IO where each IO is 512 bytes in size,
> then 30 MB/sec would equal over 61000 IOPS.
> 
> Single 15k SAS/FC disk can do around 300-400 random IOPS max, so 61000 IOPS
> would require you to have around 150x (15k) disks in raid-0.
> 
> -- Pasi

Actually that was 15 MB/sec (the units were blocks).
But it's true that our access pattern is quite random. There are 30+ users
logged in over ssh, others access their mail over imap/pop3 and also
some 80 PCs mount the user's home directories over NFS.
Sometimes 30 users use Netbeans at once + there's Firefox ...

The real issue here is not the overall speed and I'm sorry if I didn't make
myself clear. The problem is that when there is a large amount of writes
to a single LUN only a small percentage of requests (if any) make it to the
other LUNs.

I ran another test to compare what happens when I don't use the disk cache
(oflags=direct equals opening the output file with O_DIRECT).

I generate some reads from all LUNS and all looks well (now its in KiB):

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             327.00      1440.00      5940.00       1440       5940
sdb              19.00      1588.00        28.00       1588         28
sdc              15.00      1720.00         0.00       1720          0
sdd              21.00      1700.00        32.00       1700         32
sde              28.00      1660.00        60.00       1660         60
sdf              13.00      1664.00         0.00       1664          0
sdg              71.00      1664.00       228.00       1664        228

Then I run
$ dd  if=/dev/zero of=file bs=$((2**20)) count=128.
It finishes in half a second and after a while iostat says:

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               1.00       140.00         0.00        140          0
sdb               1.00       192.00         0.00        192          0
sdc               1.00       180.00         0.00        180          0
sdd               1.00       128.00         0.00        128          0
sde               1.00       128.00         0.00        128          0
sdf              46.00       144.00     23400.00        144      23400
sdg               2.00       128.00         4.00        128          4

On the other hand, if I run
$ dd oflag=direct if=/dev/zero of=file bs=$((2**20)) count=128,
it takes cca 8 seconds to finish and iostat says something like:

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             258.42      1702.97      3251.49       1720       3284
sdb              30.69      1710.89       102.97       1728        104
sdc              45.54      1699.01       704.95       1716        712
sdd              23.76      1817.82        15.84       1836         16
sde              18.81      1766.34        27.72       1784         28
sdf              85.15      1770.30     16308.91       1788      16472
sdg              62.38      1778.22       198.02       1796        200

Is it possible that writing the FS cache has higher priority than other 
accesses?

> 
>> I don't know your DS-4300 at all but if you're using a SAN or an FC loop  
>> to connect to your array, here are (maybe) a few things you might want to 
>> look for:
>>
>> - What kind of disks are used in your DS4300? 10k or 15k rpm FC disks? Did
>>   you check how heavily used were your disks during transfers? (there
>>   should be software provided with the array to allow that, perhaps even
>>   an embedded webserver).
7200 rpm SATA.

>>
>> - Did you monitor your arrays Fibre Aadapter activity? (unless you're the
>>   sole user of the array and no other server can hit the same physical
>>   disks in which case you're most likely not overloading it).
I did not, but this server is the only one accessing the array.

>>
>> - Do you have multiple paths from your server to your switch and/or to
>>   your array? (even if the array is only active/passive and 2gbps; having
>>   multiple paths provides redundancy and better performance with correct
>>   configuration).
No.

>>
>> - What kind of data is your FS holding (many little files, hundreds of
>>   thousands of file, etc..?). Tuning the FS or switching to a different FS
>>   type can help..
>>
>> - If there is no bottleneck noticed above, then stripping might help
>>   (that's what we use here on active/active DMX arrays) but take care not
>>   to end up on the same physical disks at the array block level..
>>
Yes, it might help, but there's no easy way to do the switch.

>> My 2c,
>>
>> Vincent

Thanks for your suggestions,

Jiri Novosad

>>
>> On Wed, 24 Mar 2010, Jiri Novosad wrote:
>>
>>> Hello,
>>>
>>> we have a problem with our disk array. It might be even in HW, I'm not sure.
>>> The array holds home directories of our users + mail.
>>>
>>> HW configuration:
>>>
>>> a HP DL585 server, with four 6-core Opterons, 128GiB RAM
>>>
>>> array: IBM DS4300 with 7 LUNs, each a RAID5 with 4 disks (250GB).
>>> Fibre Channel: QLogic Corp. ISP2432-based 4Gb Fibre Channel to PCI Express 
>>> HBA
>>>  (the array only supports 2Gb)
>>> NCQ queue depth is 32
>>>
>>> SW configuration:
>>>
>>> RHEL5.3
>>>
>>> the home partition is a linear LVM volume:
>>>
>>> # lvdisplay -m /dev/array-vg/newhome
>>>  --- Logical volume ---
>>>  LV Name                /dev/array-vg/newhome
>>>  VG Name                array-vg
>>>  LV UUID                9XxWH5-5yv4-t661-K24d-Hdzg-G0aW-zUxRul
>>>  LV Write Access        read/write
>>>  LV Status              available
>>>  # open                 1
>>>  LV Size                2.18 TB
>>>  Current LE             571393
>>>  Segments               9
>>>  Allocation             inherit
>>>  Read ahead sectors     auto
>>>  - currently set to     256
>>>  Block device           253:7
>>>
>>>  --- Segments ---
>>>  Logical extent 0 to 66998:
>>>    Type                linear
>>>    Physical volume     /dev/sda
>>>    Physical extents    111470 to 178468
>>>
>>>  Logical extent 66999 to 133997:
>>>    Type                linear
>>>    Physical volume     /dev/sdb
>>>    Physical extents    111470 to 178468
>>>
>>>  Logical extent 133998 to 200996:
>>>    Type                linear
>>>    Physical volume     /dev/sdc
>>>    Physical extents    111470 to 178468
>>>
>>>  Logical extent 200997 to 267995:
>>>    Type                linear
>>>    Physical volume     /dev/sdd
>>>    Physical extents    111470 to 178468
>>>
>>>  Logical extent 267996 to 334994:
>>>    Type                linear
>>>    Physical volume     /dev/sde
>>>    Physical extents    111470 to 178468
>>>
>>>  Logical extent 334995 to 401993:
>>>    Type                linear
>>>    Physical volume     /dev/sdf
>>>    Physical extents    111470 to 178468
>>>
>>>  Logical extent 401994 to 468992:
>>>    Type                linear
>>>    Physical volume     /dev/sdg
>>>    Physical extents    111470 to 178468
>>>
>>>  Logical extent 468993 to 527946:
>>>    Type                linear
>>>    Physical volume     /dev/sdg
>>>    Physical extents    15945 to 74898
>>>
>>>  Logical extent 527947 to 571392:
>>>    Type                linear
>>>    Physical volume     /dev/sdc
>>>    Physical extents    15945 to 59390
>>>
>>> All LUNs use the deadline scheduler.
>>>
>>> Now the problem:
>>> whenever there is a 'large' write (in the order of hundreds of megabytes),
>>> the system load rises considerably.
>>> Inspection using iostat shows that from something like this:
>>>
>>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>>> sda             373.00         8.00      7792.00          8       7792
>>> sdb              11.00         8.00        80.00          8         80
>>> sdc              13.00         8.00        96.00          8         96
>>> sdd               9.00         8.00        80.00          8         80
>>> sde              23.00         8.00       296.00          8        296
>>> sdf               9.00         8.00        80.00          8         80
>>> sdg               5.00         8.00        32.00          8         32
>>>
>>> after a $ dd if=/dev/zero of=file bs=$((2**20)) count=128
>>> it goes to this:
>>>
>>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>>> sda               0.00         0.00         0.00          0          0
>>> sdb               0.00         0.00         0.00          0          0
>>> sdc               0.00         0.00         0.00          0          0
>>> sdd               0.00         0.00         0.00          0          0
>>> sde              31.00         8.00     28944.00          8      28944
>>> sdf               1.00         8.00         0.00          8          0
>>> sdg               1.00         8.00         0.00          8          0
>>>
>>> and when I generate some reads it goes from
>>>
>>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>>> sda             171.00      3200.00      3448.00       3200       3448
>>> sdb              24.00      3336.00        56.00       3336         56
>>> sdc              17.00      3280.00        16.00       3280         16
>>> sdd              15.00      3208.00        24.00       3208         24
>>> sde              18.00      3200.00        56.00       3200         56
>>> sdf              18.00      3192.00        40.00       3192         40
>>> sdg              23.00      3184.00       144.00       3184        144
>>>
>>> to
>>>
>>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>>> sda               5.00       392.00        88.00        392         88
>>> sdb               2.00       352.00         0.00        352          0
>>> sdc               2.00       264.00         0.00        264          0
>>> sdd               2.00       264.00         0.00        264          0
>>> sde             277.00       560.00     38744.00        560      38744
>>> sdf               2.00       264.00         0.00        264          0
>>> sdg               1.00       296.00         0.00        296          0
>>>
>>> It looks like the single write somehow cancels out all other requests.
>>>
>>> Switching to a striped LVM volume would probably help, but the data 
>>> migration would
>>> be really painful for us.
>>>
>>> Has anyone an idea where the problem might be? Any pointers would be 
>>> appreciated.
>>>
>>> Regards,
>>> Jiri Novosad
>>
>> _______________________________________________
>> rhelv5-list mailing list
>> [email protected]
>> https://www.redhat.com/mailman/listinfo/rhelv5-list
> 
> _______________________________________________
> rhelv5-list mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/rhelv5-list

_______________________________________________
rhelv5-list mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/rhelv5-list

Re: [rhelv5-list] DS4300: high load in case of large writes

Reply via email to