On 25/09/2009, at 11:49 AM, Bob Friesenhahn wrote:
The commentary says that normally the COMMIT operations occur during
close(2) or fsync(2) system call, or when encountering memory
pressure. If the problem is slow copying of many small files, this
COMMIT approach does not help very much since very little data is
sent per file and most time is spent creating directories and files.
The problem appears to be slog bandwidth exhaustion due to all data
being sent via the slog creating a contention for all following NFS or
locally synchronous writes. The NFS writes do not appear to be
synchronous in nature - there is only a COMMIT being issued at the
very end, however, all of that data appears to be going via the slog
and it appears to be inflating to twice its original size.
For a test, I just copied a relatively small file (8.4MB in size).
Looking at a tcpdump analysis using wireshark, there is a SETATTR
which ends with a V3 COMMIT and no COMMIT messages during the transfer.
iostat output that matches looks like this:
slog write of the data (17MB appears to hit the slog)
Friday, 25 September 2009 1:01:00 PM EST
extended device statistics ----
errors ---
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w
trn tot device
0.0 135.0 0.0 17154.5 0.0 0.8 0.0 6.0 0 3 0
0 0 0 c7t2d0
then a few seconds later, the transaction group gets flushed to
primary storage writing nearly 11.4MB which is inline with raid Z2
(expect around 10.5MB; 8.4/8*10):
Friday, 25 September 2009 1:01:13 PM EST
extended device statistics ----
errors ---
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w
trn tot device
0.0 91.0 0.0 1170.4 0.0 0.1 0.0 1.3 0 2 0
0 0 0 c11t0d0
0.0 84.0 0.0 1171.4 0.0 0.1 0.0 1.2 0 2 0
0 0 0 c11t1d0
0.0 92.0 0.0 1172.4 0.0 0.1 0.0 1.2 0 2 0
0 0 0 c11t2d0
0.0 84.0 0.0 1172.4 0.0 0.1 0.0 1.3 0 2 0
0 0 0 c11t3d0
0.0 81.0 0.0 1176.4 0.0 0.1 0.0 1.4 0 2 0
0 0 0 c11t4d0
0.0 86.0 0.0 1176.4 0.0 0.1 0.0 1.4 0 2 0
0 0 0 c11t5d0
0.0 89.0 0.0 1175.4 0.0 0.1 0.0 1.4 0 2 0
0 0 0 c11t6d0
0.0 84.0 0.0 1175.4 0.0 0.1 0.0 1.3 0 2 0
0 0 0 c11t7d0
0.0 91.0 0.0 1168.9 0.0 0.1 0.0 1.3 0 2 0
0 0 0 c11t8d0
0.0 89.0 0.0 1170.9 0.0 0.1 0.0 1.4 0 2 0
0 0 0 c11t9d0
So I performed the same test with a much larger file (533MB) to see
what it would do, being larger than the NVRAM cache in front of the
SSD. Note that after the second second of activity the NVRAM is full
and only allowing in about the sequential write speed of the SSD
(~70MB/s).
Friday, 25 September 2009 1:13:14 PM EST
extended device statistics ----
errors ---
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w
trn tot device
0.0 640.9 0.0 81782.9 0.0 4.2 0.0 6.5 1 14 0
0 0 0 c7t2d0
0.0 1065.7 0.0 136408.1 0.0 18.6 0.0 17.5 1 78 0
0 0 0 c7t2d0
0.0 579.0 0.0 74113.3 0.0 30.7 0.0 53.1 1 100 0
0 0 0 c7t2d0
0.0 588.7 0.0 75357.0 0.0 33.2 0.0 56.3 1 100 0
0 0 0 c7t2d0
0.0 532.0 0.0 68096.3 0.0 31.5 0.0 59.1 1 100 0
0 0 0 c7t2d0
0.0 559.0 0.0 71428.0 0.0 32.5 0.0 58.1 1 100 0
0 0 0 c7t2d0
0.0 542.0 0.0 68755.9 0.0 25.1 0.0 46.4 1 100 0
0 0 0 c7t2d0
0.0 542.0 0.0 69376.4 0.0 35.0 0.0 64.6 1 100 0
0 0 0 c7t2d0
0.0 581.0 0.0 74368.0 0.0 30.6 0.0 52.6 1 100 0
0 0 0 c7t2d0
0.0 567.0 0.0 72574.1 0.0 33.2 0.0 58.6 1 100 0
0 0 0 c7t2d0
0.0 564.0 0.0 72194.1 0.0 31.1 0.0 55.2 1 100 0
0 0 0 c7t2d0
0.0 573.0 0.0 73343.5 0.0 33.2 0.0 57.9 1 100 0
0 0 0 c7t2d0
0.0 536.3 0.0 68640.5 0.0 33.1 0.0 61.7 1 100 0
0 0 0 c7t2d0
0.0 121.9 0.0 15608.9 0.0 2.7 0.0 22.1 0 22 0
0 0 0 c7t2d0
Again, the slog wrote about double the file size (1022.6MB) and a few
seconds later, the data was pushed to the primary storage (684.9MB
with an expectation of 666MB = 533MB/8*10) so again about the right
number hit the spinning platters.
Friday, 25 September 2009 1:13:43 PM EST
extended device statistics ----
errors ---
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w
trn tot device
0.0 338.3 0.0 32794.4 0.0 13.7 0.0 40.6 1 47 0
0 0 0 c11t0d0
0.0 325.3 0.0 31399.8 0.0 13.7 0.0 42.0 1 47 0
0 0 0 c11t1d0
0.0 339.3 0.0 33273.3 0.0 13.7 0.0 40.3 1 47 0
0 0 0 c11t2d0
0.0 332.3 0.0 32009.0 0.0 13.7 0.0 41.4 0 47 0
0 0 0 c11t3d0
0.0 352.3 0.0 34364.0 0.0 13.7 0.0 39.0 1 47 0
0 0 0 c11t4d0
0.0 355.2 0.0 33788.7 0.0 13.7 0.0 38.6 1 47 0
0 0 0 c11t5d0
0.0 352.3 0.0 33452.3 0.0 13.8 0.0 39.3 1 47 0
0 0 0 c11t6d0
0.0 339.3 0.0 32873.5 0.0 13.7 0.0 40.4 1 47 0
0 0 0 c11t7d0
0.0 337.3 0.0 32889.0 0.0 13.5 0.0 40.0 1 47 0
0 0 0 c11t8d0
0.0 336.3 0.0 32441.9 0.0 13.7 0.0 40.9 1 47 0
0 0 0 c11t9d0
Friday, 25 September 2009 1:13:44 PM EST
extended device statistics ----
errors ---
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w
trn tot device
0.0 349.7 0.0 35677.0 0.0 16.1 0.0 45.9 0 48 0
0 0 0 c11t0d0
0.0 367.7 0.0 37078.3 0.0 16.1 0.0 43.8 0 49 0
0 0 0 c11t1d0
0.0 348.7 0.0 35197.1 0.0 16.3 0.0 46.9 0 49 0
0 0 0 c11t2d0
0.0 360.7 0.0 36467.7 0.0 15.9 0.0 44.1 0 48 0
0 0 0 c11t3d0
0.0 342.7 0.0 34103.9 0.0 16.2 0.0 47.2 0 48 0
0 0 0 c11t4d0
0.0 347.7 0.0 34682.1 0.0 16.0 0.0 46.0 0 48 0
0 0 0 c11t5d0
0.0 349.7 0.0 35018.3 0.0 16.3 0.0 46.7 0 49 0
0 0 0 c11t6d0
0.0 353.7 0.0 35600.5 0.0 16.1 0.0 45.6 0 49 0
0 0 0 c11t7d0
0.0 350.7 0.0 35580.5 0.0 16.2 0.0 46.1 0 49 0
0 0 0 c11t8d0
0.0 356.7 0.0 36031.0 0.0 15.9 0.0 44.4 0 48 0
0 0 0 c11t9d0
Can anybody explain what is going on with the slog device in that all
data is being shunted via it and why about double the data size is
being written to it per transaction?
cheers,
James
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss