Re: [PATCH 1/1] Btrfs: heuristic add shannon entropy calculation

Timofey Titovets Fri, 06 Oct 2017 17:09:19 -0700

2017-10-06 21:24 GMT+03:00 David Sterba <dste...@suse.cz>:
> On Thu, Oct 05, 2017 at 01:07:26PM +0300, Timofey Titovets wrote:
>> Byte distribution check in heuristic will filter edge data
>> cases and some time fail to classificate input data
>>
>> Let's fix that by adding Shannon entropy calculation,
>> that will cover classification of most other data types.
>>
>> As Shannon entropy need log2 with some precision to work,
>> i precalculate table for our max sample size (8KiB).
>>
>> Shannon entropy slightly changed for avoiding signed numbers
>> and divisions
>>
>> Signed-off-by: Timofey Titovets <nefelim...@gmail.com>
>> ---
>>  fs/btrfs/compression.c | 314 
>> ++++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 313 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
>> index 0060bc4ae5f2..b002ee980243 100644
>> --- a/fs/btrfs/compression.c
>> +++ b/fs/btrfs/compression.c
>> @@ -1225,6 +1225,290 @@ int btrfs_decompress_buf2page(const char *buf, 
>> unsigned long buf_start,
>>  }
>>
>>
>> + /*
>> +  * Precalculated log2 values for 0 - 8193 range
>> +  * return data shifted to left for 4 bit,
>> +  * for improve precision without float point.
>> +  *
>> +  * Used only in shannon entropy for heuristic
>> +  *
>> +  * Only first 128 elements precalculated for save memory
>> +  */
>> +
>> +#define LOG2_RET_SHIFT (1 << 4)
>> +
>> +static int log2_lshift4(uint16_t x)
>> +{
>> +     /*
>> +      * Predefined array for first 128 values
>> +      * Python generator example
>> +      * import math
>> +      * for i in range(1, 128):
>> +      *     print(int(math.log2(i)*16),end=', ')
>> +      */
>> +     uint8_t ret[128] = {
>
>         static const uint8_t ret
>
> Otherwise it consumes 128 bytes of the stack space.
>
>> +             0,    0,    16,   25,   32,   37,   41,   44,   48,   50,
>> +             53,   55,   57,   59,   60,   62,   64,   65,   66,   67,
>> +             69,   70,   71,   72,   73,   74,   75,   76,   76,   77,
>> +             78,   79,   80,   80,   81,   82,   82,   83,   83,   84,
>> +             85,   85,   86,   86,   87,   87,   88,   88,   89,   89,
>> +             90,   90,   91,   91,   92,   92,   92,   93,   93,   94,
>> +             94,   94,   95,   95,   96,   96,   96,   97,   97,   97,
>> +             98,   98,   98,   99,   99,   99,   99,   100,  100,  100,
>> +             101,  101,  101,  102,  102,  102,  102,  103,  103,  103,
>> +             103,  104,  104,  104,  104,  105,  105,  105,  105,  106,
>> +             106,  106,  106,  106,  107,  107,  107,  107,  108,  108,
>> +             108,  108,  108,  109,  109,  109,  109,  109,  110,  110,
>> +             110,  110,  110,  111,  111,  111,  111,  111
>> +
>> +     };
>> +
>> +
>> +     if (x < 128)
>> +             return ret[x];
>> +
>> +     if (x < 1024) {
>> +             if (x < 256) {
>> +             }
>> +             if (x < 470) {
>> +             }
>> +             if (x < 981) {
>> +             }
>> +             return 159;
>> +     }
>> +
>> +     if (x < 8193) {
>> +             if (x < 2048) {
>> +             }
>> +             if (x < 4096) {
>> +             }
>> +             if (x < 7845) {
>> +             }
>> +             return 207;
>> +     }
>> +     return 208;
>
> I hear the compiler scream, trying to optimize all the ifs. And I think
> the CPU branch prediction would have a hard time, there's practically no
> expected pattern and the performance could be worse compared to the
> formula calculation.
>
> What's the reason you opencode it like that? Could you please post how
> it would be implemented in C? There are potentially optimized functions
> to do integer log2 (ilog2).


That is just:
log2_lshift4(n) { return ilog2(pow(n, 16)); }
But that will cause a overflow.

log2(8192) -> 13 - 13 bit long
13 * 4 = 52, 52 < 64

That also restrict sample size to 15 bit

So we can get approximately same behaviour,
by doing like ilog2(pow(n, 4));

https://pastebin.com/VryvqkCy - some numbers
TL;DR version:
pow(n, 4)
 - count        100          7    28300
 - error %    0.0%     1.1%    3.0%
pow(n, 16)
 - count     18843    11483    14
 - error %     0.0%     1.0%    2.8%

(I sample Win10.iso image as just big file with different data).
I gen error value by:
if (shannon_e_f > 0)
    error = 100 - (shannon_e_i * 1.0 * 100 / shannon_e_f);

In fact that cause errors in entropy level like:
TRUE -> Integer
83 -> 80
87 -> 84
32 -> 28

In theory that possible to just make entropy level markers more aggressive,
like: 70->65 and 85 -> 80

That will cover precision errors.

or make a byte level BigInt in kernel (bigint_log2, bigint_Lshift & etc) %)
(But this seems inexcusable to me + i'm unsure about performance + I'm
afraid Big/Little.Endian)

> Code like that could be microbenchmarked so we can see actual numbers,
> so far I'm just guessing.

I'm mentioned in a cover letter https://github.com/Nefelim4ag/companalyze
(implement the heuristic with no optimizations, all code run, to get
RAW counters)
Over in memory bad compressible data:
 8188. BSize: 131072, RepPattern: 0, BSet: 256, BCSet: 220, ShanEi%:
99| 99 ~0.0%, RndDist:     0, out: -3

With -O0 - ~1100 MiB/s
With -O2 - ~2500 MiB/s

With (i found in kernel and add pow(n, 4))
static int ilog2(uint64_t v)
{
    int l = 0;
    v = v*v*v*v;
    while ((1UL << l) < v)
        l++;
    return l;
}

With -O0 - ~1150 MiB/s
With -O2 - ~2500 MiB/s

So, my solutions looks more bad at code, than performance
(i may be that because most requests for bad compressible data got
directly to lookup table)

Thanks!

>> +}
>> +
>> +
>> +/*
>> + * Shannon Entropy calculation
>> + *
>> + * Pure byte distribution analyze fail to determine
>> + * compressiability of data. Try calculate entropy to
>> + * estimate the average minimum number of bits needed
>> + * to encode a sample data.
>> + *
>> + * For comfort, use return of percentage of needed bit's,
>> + * instead of bit's amaount directly.
>> + *
>> + * @ENTROPY_LVL_ACEPTABLE - below that threshold sample has low byte
>> + * entropy and can be compressible with high probability
>> + *
>> + * @ENTROPY_LVL_HIGH - data are not compressible with high probability,
>> + */
>> +
>> +#define ENTROPY_LVL_ACEPTABLE 70
>> +#define ENTROPY_LVL_HIGH 85
>> +
>> +static uint32_t shannon_entropy(struct heuristic_ws *ws)
>> +{
>> +     const u32 entropy_max = 8*LOG2_RET_SHIFT;
>> +     const u32 q = log2_lshift4(ws->sample_size);
>> +     u32 p, i;
>> +     u64 entropy_sum;
>> +
>> +     entropy_sum = 0;
>> +     for (i = 0; i < BUCKET_SIZE && ws->bucket[i].count > 0; i++) {
>> +             p = ws->bucket[i].count;
>> +             entropy_sum += p * (q - log2_lshift4(p));
>> +     }
>> +
>> +     entropy_sum = div_u64(entropy_sum, ws->sample_size);
>> +     return div_u64(entropy_sum * 100, entropy_max);I
>
> I wonder if some of the 64bit types and 64bit division could be avoided.



-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] Btrfs: heuristic add shannon entropy calculation

Reply via email to