On 1/27/21 5:35 AM, [email protected] wrote:
From: Filipe Manana <[email protected]>

Often an fsync needs to fallback to a transaction commit for several
reasons (to ensure consistency after a power failure, a new block group
was allocated or a temporary error such as ENOMEM or ENOSPC happened).

In that case the log is marked as needing a full commit and any concurrent
tasks attempting to log inodes or commit the log will also fallback to the
transaction commit. When this happens they all wait for the task that first
started the transaction commit to finish the transaction commit - however
they wait until the full transaction commit happens, which is not needed,
as they only need to wait for the superblocks to be persisted and not for
unpinning all the extents pinned during the transaction's lifetime, which
even for short lived transactions can be a few thousand and take some
significant amount of time to complete - for dbench workloads I have
observed up to 4~5 milliseconds of time spent unpinning extents in the
worst cases, and the number of pinned extents was between 2 to 3 thousand.

So allow fsync tasks to skip waiting for the unpinning of extents when
they call btrfs_commit_transaction() and they were not the task that
started the transaction commit (that one has to do it, the alternative
would be to offload the transaction commit to another task so that it
could avoid waiting for the extent unpinning or offload the extent
unpinning to another task).

This patch is part of a patchset comprised of the following patches:

   btrfs: remove unnecessary directory inode item update when deleting dir entry
   btrfs: stop setting nbytes when filling inode item for logging
   btrfs: avoid logging new ancestor inodes when logging new inode
   btrfs: skip logging directories already logged when logging all parents
   btrfs: skip logging inodes already logged when logging new entries
   btrfs: remove unnecessary check_parent_dirs_for_sync()
   btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

After applying the entire patchset, dbench shows improvements in respect
to throughput and latency. The script used to measure it is the following:

   $ cat dbench-test.sh
   #!/bin/bash

   DEV=/dev/sdk
   MNT=/mnt/sdk
   MOUNT_OPTIONS="-o ssd"
   MKFS_OPTIONS="-m single -d single"

   echo "performance" | tee 
/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

   umount $DEV &> /dev/null
   mkfs.btrfs -f $MKFS_OPTIONS $DEV
   mount $MOUNT_OPTIONS $DEV $MNT

   dbench -D $MNT -t 300 64

   umount $MNT

The test was run on a physical machine with 12 cores (Intel corei7), 64G
of ram, using a NVMe device and a non-debug kernel configuration (Debian's
default configuration).

Before applying patchset, 32 clients:

  Operation      Count    AvgLat    MaxLat
  ----------------------------------------
  NTCreateX    9627107     0.153    61.938
  Close        7072076     0.001     3.175
  Rename        407633     1.222    44.439
  Unlink       1943895     0.658    44.440
  Deltree          256    17.339   110.891
  Mkdir            128     0.003     0.009
  Qpathinfo    8725406     0.064    17.850
  Qfileinfo    1529516     0.001     2.188
  Qfsinfo      1599884     0.002     1.457
  Sfileinfo     784200     0.005     3.562
  Find         3373513     0.411    30.312
  WriteX       4802132     0.053    29.054
  ReadX        15089959     0.002     5.801
  LockX          31344     0.002     0.425
  UnlockX        31344     0.001     0.173
  Flush         674724     5.952   341.830

Throughput 1008.02 MB/sec  32 clients  32 procs  max_latency=341.833 ms

After applying patchset, 32 clients:

After patchset, with 32 clients:

  Operation      Count    AvgLat    MaxLat
  ----------------------------------------
  NTCreateX    9931568     0.111    25.597
  Close        7295730     0.001     2.171
  Rename        420549     0.982    49.714
  Unlink       2005366     0.497    39.015
  Deltree          256    11.149    89.242
  Mkdir            128     0.002     0.014
  Qpathinfo    9001863     0.049    20.761
  Qfileinfo    1577730     0.001     2.546
  Qfsinfo      1650508     0.002     3.531
  Sfileinfo     809031     0.005     5.846
  Find         3480259     0.309    23.977
  WriteX       4952505     0.043    41.283
  ReadX        15568127     0.002     5.476
  LockX          32338     0.002     0.978
  UnlockX        32338     0.001     2.032
  Flush         696017     7.485   228.835

Throughput 1049.91 MB/sec  32 clients  32 procs  max_latency=228.847 ms

  --> +4.1% throughput, -39.6% max latency

Before applying patchset, 64 clients:

  Operation      Count    AvgLat    MaxLat
  ----------------------------------------
  NTCreateX    8956748     0.342   108.312
  Close        6579660     0.001     3.823
  Rename        379209     2.396    81.897
  Unlink       1808625     1.108   131.148
  Deltree          256    25.632   172.176
  Mkdir            128     0.003     0.018
  Qpathinfo    8117615     0.131    55.916
  Qfileinfo    1423495     0.001     2.635
  Qfsinfo      1488496     0.002     5.412
  Sfileinfo     729472     0.007     8.643
  Find         3138598     0.855    78.321
  WriteX       4470783     0.102    79.442
  ReadX        14038139     0.002     7.578
  LockX          29158     0.002     0.844
  UnlockX        29158     0.001     0.567
  Flush         627746    14.168   506.151

Throughput 924.738 MB/sec  64 clients  64 procs  max_latency=506.154 ms

After applying patchset, 64 clients:

  Operation      Count    AvgLat    MaxLat
  ----------------------------------------
  NTCreateX    9069003     0.303    43.193
  Close        6662328     0.001     3.888
  Rename        383976     2.194    46.418
  Unlink       1831080     1.022    43.873
  Deltree          256    24.037   155.763
  Mkdir            128     0.002     0.005
  Qpathinfo    8219173     0.137    30.233
  Qfileinfo    1441203     0.001     3.204
  Qfsinfo      1507092     0.002     4.055
  Sfileinfo     738775     0.006     5.431
  Find         3177874     0.936    38.170
  WriteX       4526152     0.084    39.518
  ReadX        14213562     0.002    24.760
  LockX          29522     0.002     1.221
  UnlockX        29522     0.001     0.694
  Flush         635652    14.358   422.039

Throughput 990.13 MB/sec  64 clients  64 procs  max_latency=422.043 ms

  --> +6.8% throughput, -18.1% max latency

Signed-off-by: Filipe Manana <[email protected]>

Funnily enough I've got a patch to async off unpinning as it drastically affects our WhatsApp workload. Once that lands maybe we can undo this bit.

Reviewed-by: Josef Bacik <[email protected]>

Thanks,

Josef

Reply via email to