I owe you all an update... We found out a clear pattern we can now recreate at will. Whenever we read/write the pool, it gives expected throughput and IOPS for a while, but at some point it slows down to a crawl, nothing is responding and pretty much "hang" for a few seconds and then things go back to normal for a little more while. Sometimes the problem is barely noticeable and only happen once every few minutes, at other times it is every few seconds. We could be doing the exact same operation and sometimes it is fast and sometimes it is slow. The more clients are connected the worse the problem typically gets- and no, it's not happening every 30 seconds when things are committed to disk.
Now... every time that slow down occurs, the load on the Nexenta box gets crazy high- it can reach 35 and more and the console dont even respond anymore. The rest of the time the load barely reaches 3. The box has four 7500 series Intel Xeon CPUs and 256G of RAM and use 15K SAS HDDs in mirrored stripes on LSI 9200-8e HBAs- so we're certainly not underpowered. We also have the same issue when using a box with two of the latest AMD Opteron CPUs (the Magny-Cours) and 128G of RAM. We are able to reach 800MB/sec and more over the network when things go well, but the average get destroyed by the slow downs when there is zero throughput. These tests are run without any L2ARC or SLOG, but past tests have shown the same issue when using them. We've tried with 12x 100G Samsung SLC SSDs and DDRDrive X1s among other thing- and while they make the whole thing much faster, they don't prevent those intermittent slow downs from happening. Our next step is to isolate the process that take all that CPU... Ian -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss