Re: [zfs-discuss] Very poor small-block random write performance

Jim Klimov Sat, 21 Jul 2012 12:01:37 -0700

2012-07-20 5:11, Bob Friesenhahn wrote:

On Fri, 20 Jul 2012, Jim Klimov wrote:


Zfs data block sizes are fixed size!  Only tail blocks are shorter.


This is the part I am not sure is either implied by the docs
nor confirmed by my practice. But maybe I've missed something...


This is something that I am quite certain of. :-)

When doing a random write inside a file, the unit of COW is the zfs
filesystem blocksize.


Well, apparently I was wrong, and Bob was right :)

I ran a simple test like this:

# zfs create -o compression=off -o dedup=off -o copies=1 rpool/test

This should rule out complex storage options for user-data
bytes.


# cd /rpool/test/ && touch file && ls -lai
total 9
         4 drwxr-xr-x   2 root     root           3 Jul 21 22:38 .
         4 drwxr-xr-x   5 root     root           5 Jul 21 22:37 ..
         8 -rw-r--r--   1 root     root           0 Jul 21 22:38 file

So the file's inode number is 8 (above). This is used for zdbinspections (below).



# /usr/gnu/bin/dd if=/dev/random  bs=1k count=1 >> file; sync; \\
  zdb -dddddddd rpool/test 8 | grep ' L. '

The last line was repeated a few times. Apparently, (as Bob
wrote me off-list), changes in the tail block cause it to be
read from disk completely, new bytes appended, and written
out - up to dataset recordsize. Thus all intermediate blocks
of a file should consume full recordsizes, even if it was
appended in small portions spread over several TXGs.

Replacing kilobytes at locations spawning one or two blocks
also caused reallocations and rewrites of zfs recordsized
pieces:

# /usr/gnu/bin/dd if=/dev/random of=/rpool/test/file bs=1k \\
  seek=12 count=10 conv=noerror,notrunc; sync

# /usr/gnu/bin/dd if=/dev/random of=/rpool/test/file bs=1k \\
  seek=125 count=10 conv=noerror,notrunc; sync

# zdb -dddddddd rpool/test 8 | grep ' L. '

Dataset rpool/test [ZPL], ID 8412, cr_txg 2110110, 289K, 8 objects,rootbp DVA[0]=<0:a69b82a00:200> DVA[1]=<0:264111e00:200> [L0 DMU objset]fletcher4 lzjb LE contiguous unique double size=800L/200Pbirth=2110309L/2110309P fill=8cksum=1373a7e215:68dacd0d409:12c745c671d52:25e0616461e20c0 L1 0:a69b80200:400 0:263f1dc00:400 4000L/400P F=2B=2110309/2110309

               0  L0 0:a61bcb200:20000 20000L/20000P F=1 B=2110309/2110309
           20000  L0 0:a61b87a00:20000 20000L/20000P F=1 B=2110304/2110304


During this quick test I did not manage to craft a test which
would inflate a file in the middle without touching its other
blocks (other than using a text editor which saves the whole
file - so that is irrelevant), in order to see if ZFS can
"insert" smaller blocks in the middle of an existing file,
and whether it would reallocate other blocks to fit the set
recordsizes.

For generic filesystem uses (append, replace 1:1) at least
Bob's assessment is right - zfs stores recordsized blocks
and one possibly smaller tail block, not a series of random
sized blocks as I implied.

I might imagine situations like heavily congested systems
where zfs might cut corners to get dirty bytes out to disk
faster - and not read-merge-write tail blocks, but even if
this is implemented at all, it should be a rare condition.

//Jim Klimov

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Very poor small-block random write performance

Reply via email to