I owe you all an update...

We found out a clear pattern we can now recreate at will.  Whenever we 
read/write the pool, it gives expected throughput and IOPS for a while, but at 
some point it slows down to a crawl, nothing is responding and pretty much 
"hang" for a few seconds and then things go back to normal for a little more 
while.  Sometimes the problem is barely noticeable and only happen once every 
few minutes, at other times it is every few seconds.  We could be doing the 
exact same operation and sometimes it is fast and sometimes it is slow. The 
more clients are connected the worse the problem typically gets- and no, it's 
not happening every 30 seconds when things are committed to disk. 

Now... every time that slow down occurs, the load on the Nexenta box gets crazy 
high- it can reach 35 and more and the console dont even respond anymore.  The 
rest of the time the load barely reaches 3.  The box has four 7500 series Intel 
Xeon CPUs and 256G of RAM and use 15K SAS HDDs in mirrored stripes on LSI 
9200-8e HBAs- so we're certainly not underpowered.  We also have the same issue 
when using a box with two of the latest AMD Opteron CPUs (the Magny-Cours) and 
128G of RAM.

We are able to reach 800MB/sec and more over the network when things go well, 
but the average get destroyed by the slow downs when there is zero throughput.

These tests are run without any L2ARC or SLOG, but past tests have shown the 
same issue when using them.  We've tried with 12x 100G Samsung SLC SSDs and 
DDRDrive X1s among other thing- and while they make the whole thing much 
faster, they don't prevent those intermittent slow downs from happening.

Our next step is to isolate the process that take all that CPU...

Ian
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to