Re: [lustre-discuss] stripe count recommendation, and proposal for auto-stripe tool

Patrick Farrell Thu, 19 May 2016 08:57:27 -0700

Ah, of course - We're only talking about restriping existing stuff.


Yes, that's just fine - No lock conflicts on reading.  Looks good to me.

This is probably also something we'd want to allow via HSM. Not surehow the current patches interact with that (haven't looked).


- Patrick

On 05/19/2016 10:53 AM, Nathan Dauchy - NOAA Affiliate wrote:

Patrick,

You bring up an interesting point on read vs. write performance. Wecan't use lfs_migrate control the stripe count used for writes(obviously), so that is left up to the application developer or atleast the user to intelligently place shared access files in adirectory with wider striping. Restriping a file with lfs_migratecould change *read* performance characteristics, so there is indeedsome risk there... but your work implies that is not too bad. If weonly restripe files that are "old", then the likelyhood that they willbe read again goes way down, and balancing capacity used plays abigger factor. Bottom line is that I think restriping has morepotential for upsides than down. :)


Thanks,
Nathan

On Wed, May 18, 2016 at 1:22 PM, Patrick Farrell <p...@cray.com<mailto:p...@cray.com>> wrote:


    Nathan,

    This *is* excellent fodder for discussion.

    A few thoughts from a developer perspective.  When you stripe a
    file to multiple OSTs, you're spreading the data out across
    multiple targets, which (to my mind) has two purposes:
    1) More even space usage across OSTs (mostly relevant for *really*
    big files, since in general, singly striped files are distributed
    across OSTs anyway)
    2) Better bandwidth/parallelism for accesses to the file.

    The first one lends itself well to a file size based heuristic,
    but I'm not sure the second one does. That's more about access
    patterns.  I'm not sure that you see much bandwidth benefit from
    striping with a single client, at least as long as an individual
    OST is fast relative to a client (increasingly common, I think,
    with flash and larger RAID arrays).  So then, whatever the file
    size, if it's accessed from one client, it should probably be
    single striped.

    Also, for shared files, client count relative to stripe count has
    a huge impact on write performance. Assuming strided I/O patterns,
    anything more than 1 client per stripe/OST is actually worse than
    1 client.  (See my lock ahead presentation at LUG'15 for more on
    this.)  Read performance doesn't share this weirdness, though.

    All that's to say that for case 2 above, at least for writing,

it's access pattern/access parallelism, not size, which matters.I'm sure there's some correlation between file size and how

    parallel the access pattern is, but it might be very loose, and at

least write performance doesn't scale linearly with stripe size.Instead, the behavior is complex.


    So in order to pick an ideal striping with case 2 in mind, you
    really need to understand the application access pattern.  I can't
    see another way to do that goal justice.  (The Lustre ADIO in the
    MPI I/O library does this, partly by controlling the I/O pattern
    through I/O aggregation for collective I/Os.)

    So I think your tool can definitely help with case 1, not so sure
    about case 2.

    - Patrick

    On 05/18/2016 12:22 PM, Nathan Dauchy - NOAA Affiliate wrote:

    Greetings All,

    I'm looking for your experience and perhaps some lively
    discussion regarding "best practices" for choosing a file stripe
    count.  The Lustre manual has good tips on "Choosing a Stripe
    Size", and in practice the default 1M rarely causes problems on
    our systems. Stripe Count on the other hand is far more difficult
    to chose a single value that is efficient for a general purpose
    and multi-use site-wide file system.

    Since there is the "increased overhead" of striping, and weather
    applications do unfortunately write MANY tiny files, we usually
    keep the filesystem default stripe count at 1.  Unfortunately,
    there are several users who then write very large and
    shared-access files with that default.  I would like to be able
    to tell them to restripe... but without digging into the specific
    application and access pattern it is hard to know what count to
    recommend.  Plus there is the "stripe these but not those"
    confusion... it is common for users to have a few very large data
    files and many small log or output image files in the SAME directory.

    What do you all recommend as a reasonable rule of thumb that
    works for "most" user's needs, where stripe count can be
    determined based only on static data attributes (such as file
    size)?  I have heard a "stripe per GB" idea, but some have said
    that escalates to too many stripes too fast.  ORNL has a
    knowledge base article that says use a stripe count of "File size
    / 100 GB", but does that make sense for smaller, non-DOE sites?
    Would stripe count = Log2(size_in_GB)+1 be more generally
    reasonable?  For a 1 TB file, that actually works out to be
    similar to ORNL, only gets there more gradually:
    https://www.olcf.ornl.gov/kb_articles/lustre-basics/#Stripe_Count

    Ideally, I would like to have a tool to give the users and say
    "go restripe your directory with this command" and it will do the
    right thing in 90% of cases.  See the rough patch to lfs_migrate

(included below) which should help explain what I'm thinking.Probably there are more efficient ways of doing things, but I

    have tested it lightly and it works as a proof-of-concept.

    With a good programmatic rule of thumb, we (as a Lustre
    community!) can eventually work with application developers to
    embed the stripe count selection into their code and get things
    at least closer to right up front.  Even if trial and error is
    involved to find the optimal setting, at least the rule of thumb
    can be a _starting_point_ for the users, and they can tweak it
    from there based on application, model, scale, dataset, etc.

    Thinking farther down the road, with progressive file layout,
    what algorithm will be used as the default?  If Lustre gets to
    the point where it can rebalance OST capacity behind the scenes,
    could it also make some intelligent choice about restriping very
    large files to spread out load and better balance capacity?
     (Would that mean we need a bit set on the file to flag whether
    the stripe info was set specifically by the user or automatically
    by Lustre tools or it was just using the system default?)  Can
    the filesystem track concurrent access to a file, and perhaps
    migrate the file and adjust stripe count based on number of
    active clients?

    I appreciate any and all suggestions, clarifying questions,
    heckles, etc.  I know this is a lot of questions, and I certainly
    don't expect definitive answers on all of them, but I hope it is
    at least food for thought and discussion! :)

    Thanks,
    Nathan


    --- lfs_migrate-2.7.12016-05-13 12:46:06.828032000 +0000
    +++ lfs_migrate.auto-count2016-05-17 21:37:19.036589000 +0000
    @@ -21,8 +21,10 @@
     usage() {
         cat -- <<USAGE 1>&2
    -usage: lfs_migrate [-c <stripe_count>] [-h] [-l] [-n] [-q] [-R]
    [-s] [-y] [-0]
    +usage: lfs_migrate [-A] [-c <stripe_count>] [-h] [-l] [-n] [-q]
    [-R] [-s] [-v] [-y] [-0]
                        [file|dir ...]
    +    -A restripe file using an automatically selected stripe count
    +       currently Stripe Count = Log2(size_in_GB)
         -c <stripe_count>
            restripe file using the specified stripe count
         -h show this usage message
    @@ -31,11 +33,11 @@
         -q run quietly (don't print filenames or status)
         -R restripe file using default directory striping
         -s skip file data comparison after migrate
    +    -v be verbose and print information about each file
         -y answer 'y' to usage question
         -0 input file names on stdin are separated by a null character
    -The -c <stripe_count> option may not be specified at the same
    time as
    -the -R option.
    +Only one of the '-A', '-c', or '-R' options may be specified at
    a time.
     If a directory is an argument, all files in the directory are
    migrated.
     If no file/directory is given, the file list is read from
    standard input.
    @@ -48,15 +50,19 @@
     OPT_CHECK=y
     OPT_STRIPE_COUNT=""
    +OPT_AUTOSTRIPE=""
    +OPT_VERBOSE=""
    -while getopts "c:hlnqRsy0" opt $*; do
    +while getopts "Ac:hlnqRsvy0" opt $*; do
         case $opt in
    +A) OPT_AUTOSTRIPE=y;;
    c) OPT_STRIPE_COUNT=$OPTARG;;
    l) OPT_NLINK=y;;
    n) OPT_DRYRUN=n; OPT_YES=y;;
    q) ECHO=:;;
    R) OPT_RESTRIPE=y;;
    s) OPT_CHECK="";;
    +v) OPT_VERBOSE=y;;
    y) OPT_YES=y;;
    0) OPT_NULL=y;;
    h|\?) usage;;
    @@ -69,6 +75,16 @@
    echo "$(basename $0) error: The -c <stripe_count> option may not"
    1>&2
    echo "be specified at the same time as the -R option." 1>&2
    exit 1
    +elif [ "$OPT_STRIPE_COUNT" -a "$OPT_AUTOSTRIPE" ]; then
    +echo ""
    +echo "$(basename $0) error: The -c <stripe_count> option may
    not" 1>&2
    +echo "be specified at the same time as the -A option." 1>&2
    +exit 1
    +elif [ "$OPT_AUTOSTRIPE" -a "$OPT_RESTRIPE" ]; then
    +echo ""
    +echo "$(basename $0) error: The -A option may not be specified
    at" 1>&2
    +echo "the same time as the -R option." 1>&2
    +exit 1
     fi
     if [ -z "$OPT_YES" ]; then
    @@ -107,7 +123,7 @@
    $ECHO -n "$OLDNAME: "
    # avoid duplicate stat if possible
    -TYPE_LINK=($(LANG=C stat -c "%h %F" "$OLDNAME" || true))
    +TYPE_LINK=($(LANG=C stat -c "%h %F %s" "$OLDNAME" || true))
    # skip non-regular files, since they don't have any objects
    # and there is no point in trying to migrate them.
    @@ -127,11 +143,6 @@
    continue
    fi
    -if [ "$OPT_DRYRUN" ]; then
    -echo -e "dry run, skipped"
    -continue
    -fi
    -
    if [ "$OPT_RESTRIPE" ]; then
    UNLINK=""
    else
    @@ -140,16 +151,43 @@
    # then we don't need to do this getstripe/mktemp stuff.
    UNLINK="-u"
    -[ "$OPT_STRIPE_COUNT" ] && COUNT=$OPT_STRIPE_COUNT ||
    -COUNT=$($LFS getstripe -c "$OLDNAME" \
    -2> /dev/null)
    SIZE=$($LFS getstripe $LFS_SIZE_OPT "$OLDNAME" \
          2> /dev/null)
    +if [ "$OPT_AUTOSTRIPE" ]; then
    +FILE_SIZE=${TYPE_LINK[3]}
    +# (math in bash is dumb, so depend on common tools, and there
    are options for that...)
    +# Stripe Count = Log2(size_in_GB)
    +#COUNT=$(echo $FILE_SIZE | awk '{printf
    "%.0f\n",log($1/1024/1024/1024)/log(2)}')
    +#COUNT=$(printf "%.0f\n" $(echo "l($FILE_SIZE/1024/1024/1024) /
    l(2)" | bc -l))
    +COUNT=$(echo "l($FILE_SIZE/1024/1024/1024) / l(2) + 1" | bc -l |
    cut -d . -f 1)
    +# Stripe Count = size_in_GB
    +#COUNT=$(echo "scale=0; $FILE_SIZE/1024/1024/1024" | bc -l | cut
    -d . -f 1)
    +[ "$COUNT" -lt 1 ] && COUNT=1
    +# (does it make sense to skip the file if old
    +# and new stripe count are identical?)
    +else
    +[ "$OPT_STRIPE_COUNT" ] && COUNT=$OPT_STRIPE_COUNT ||
    +COUNT=$($LFS getstripe -c "$OLDNAME" \
    +2> /dev/null)
    +fi
    [ -z "$COUNT" -o -z "$SIZE" ] && UNLINK=""
    -SIZE=${LFS_SIZE_OPT}${SIZE}
    fi
    +if [ "$OPT_DRYRUN" ]; then
    +if [ "$OPT_VERBOSE" ]; then
    +echo -e "dry run, would use count=${COUNT} size=${SIZE}"
    +else
    +echo -e "dry run, skipped"
    +fi
    +continue
    +fi
    +if [ "$OPT_VERBOSE" ]; then
    +echo -n "(count=${COUNT} size=${SIZE}) "
    +fi
    +
    +[ "$SIZE" ] && SIZE=${LFS_SIZE_OPT}${SIZE}
    +
    # first try to migrate inside lustre
    # if failed go back to old rsync mode
    if [[ $RSYNC_MODE == false ]]; then



    _______________________________________________
    lustre-discuss mailing list
    lustre-discuss@lists.lustre.org  <mailto:lustre-discuss@lists.lustre.org>
    http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



    _______________________________________________
    lustre-discuss mailing list
    lustre-discuss@lists.lustre.org
    <mailto:lustre-discuss@lists.lustre.org>
    http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] stripe count recommendation, and proposal for auto-stripe tool

Reply via email to