[jira] [Comment Edited] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

Benedict Elliott Smith (Jira) Wed, 08 Jan 2020 12:25:32 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010990#comment-17010990
 ]


Benedict Elliott Smith edited comment on CASSANDRA-15213 at 1/8/20 8:24 PM:
----------------------------------------------------------------------------

bq. Or are you suggesting increasing the bucket resolution
bq. Do you mean having multiple buckets per offset? 

Right.  I meant one option was to essentially inline the striping by 
introducing, e.g., 4x as many buckets as requested.  Each logical bucket can be 
represented by the sum of that many buckets.  The buckets would be offset from 
each other by at least two cache lines, and all buckets would ideally be more 
uniformly distributed in memory (by e.g. {{Integer.reverse(idx)}}), so that 
adjacent buckets that are most likely to be used together don't also compete 
for cache locks.

bq. Additionally, re: your LongAdder reference, are you proposing creating new 
buckets for the same offset under contention (like LongAdder does w/ Cell)?

I meant an alternative approach might be to batch a range of buckets together - 
say 16 or more, to amortise the object overhead - and to perform 
{{LongAdder}}’s inflation of the number of buckets on the detection of 
competition, so that we can gradually increase the number of cells needed.  
This might have the added benefit of not wasting space for striping of buckets 
that are rarely used, but at the cost of greater complexity, object overheads 
and indirection.

bq. Regarding the binary search suggestion, are you suggesting using the 
approximation to replace the maximum index

I meant to use it as the floor for a linear search for the correct position, so 
that it would ordinarily be 1 or 2 comparisons, but typically only one cache 
miss.

Specifically, I meant to use an integer approximation, along the lines of...

{code}
private static final int TABLE_BITS = 4;
private static final int TABLE_MASK = -1 >>> (32 - TABLE_BITS);
private static final float[] LOG2_TABLE = computeTable(TABLE_BITS);
private static final float log2_12 = (float) slowLog2(1.2d);
private static final float log2_12_recp = (float) (1d / slowLog2(1.2d));

private static float[] computeTable(int bits)
{
    float[] table = new float[1 << bits];
    for (int i = 1 ; i < 1<<bits ; ++i)
        table[i] = (float) slowLog2(ratio(i, bits));
    return table;
}

private static float fastLog12(long v)
{
    return fastLog2(v) * log2_12_recp;
}

private static float fastLog2(long v)
{
    if (v == 0) return 0;
    int highestBitPosition = 63 - Long.numberOfLeadingZeros(v);
    v = Long.rotateRight(v, highestBitPosition - TABLE_BITS);
    int index = (int) (v & TABLE_MASK);
    float result = LOG2_TABLE[index];
    result += highestBitPosition;
    return result;
}

private static double slowLog2(double v)
{
    return Math.log(v) / Math.log(2);
}

private static double slowLog12(double v)
{
    return (Math.log(v) / Math.log(2)) * log2_12_recp;
}

private static double ratio(int i, int bits)
{
    return Float.intBitsToFloat((127 << 23) | (i << (23 - bits)));
}
{code}






was (Author: benedict):
bq. Or are you suggesting increasing the bucket resolution
bq. Do you mean having multiple buckets per offset? 

Right.  I meant one option was to essentially inline the striping by 
introducing, e.g., 4x as many buckets as requested.  Each logical bucket can be 
represented by the sum of that many buckets.  The buckets would be offset from 
each other by at least two cache lines, and all buckets would ideally be more 
uniformly distributed in memory (by e.g. {{Integer.reverse(idx)}}), so that 
adjacent buckets that are most likely to be used together don't also compete 
for cache locks.

bq. Additionally, re: your LongAdder reference, are you proposing creating new 
buckets for the same offset under contention (like LongAdder does w/ Cell)?

I meant an alternative approach might be to batch a range of buckets together - 
say 16 or more, to amortise the object overhead - and to perform 
{{LongAdder}}’s inflation of the number of buckets on the detection of 
competition, so that we can gradually increase the number of cells needed.  
This might have the added benefit of not wasting space for striping of buckets 
that are rarely used, but at the cost of greater complexity, object overheads 
and indirection.

bq. Regarding the binary search suggestion, are you suggesting using the 
approximation to replace the maximum index

I meant to use it as the floor for a linear search for the correct position, so 
that it would ordinarily be 1 or 2 comparisons, but typically only one cache 
miss.

Specifically, I meant to use an integer approximation, along the lines of...

{code}
private static final int TABLE_BITS = 4;
private static final int TABLE_MASK = -1 >>> (32 - TABLE_BITS);
private static final float[] LOG2_TABLE = computeTable(TABLE_BITS);
private static final float log2_12 = (float) slowLog2(1.2d);
private static final float log2_12_recp = (float) (1d / slowLog2(1.2d));

private static float[] computeTable(int bits)
{
    float[] table = new float[1 << bits];
    for (int i = 1 ; i < 1<<bits ; ++i)
        table[i] = (float) slowLog2(ratio(i, bits));
    return table;
}

private static float fastLog12(long v)
{
    return fastLog2(v) * log2_12_recp;
}

private static float fastLog2(long v)
{
    if (v == 0) return 0;
    int highestBitPosition = 63 - Long.numberOfLeadingZeros(v);
    v = Long.rotateRight(v, highestBitPosition - TABLE_BITS);
    int index = (int) (v & TABLE_MASK);
    float result = LOG2_TABLE[index];
    result += highestBitPosition;
    return result;
}

private static double slowLog2(double v)
{
    return Math.log(v) / Math.log(2);
}

private static double slowLog12(double v)
{
    return (Math.log(v) / Math.log(2)) * log2_12_recp;
}
{code}





> DecayingEstimatedHistogramReservoir Inefficiencies
> --------------------------------------------------
>
>                 Key: CASSANDRA-15213
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Observability/Metrics
>            Reporter: Benedict Elliott Smith
>            Assignee: Jordan West
>            Priority: Normal
>             Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

Reply via email to