[GitHub] [systemds] BACtaki opened a new pull request, #1727: [SYSTEMDS-3458] Adding Spark backend support for countDistinct() builtin

GitBox Sat, 12 Nov 2022 18:20:45 -0800


BACtaki opened a new pull request, #1727:
URL: https://github.com/apache/systemds/pull/1727


   ## Description
   This patch adds support for running countDistinct() builtin on the Spark 
backend. The implementation adds a new CountDistinctFunctionSketch to the 
sketches library. The sketch is based on an optimized HashMap lookup:
       Let [sign (1) | exponent (11) | fraction (52)] represent the bits of a 
double value
   
       - Each double is split into 2 parts:
           - PartA: [sign (1) | exponent (11)] and
           - PartB: [fraction (52)]
       - PartA is used as the key into the HashMap, and PartB is stored in the 
Set<Long> value
       - Many values in blkIn may have the same exponent, so we need to store 
multiple PartB values for any given PartA key
       - The HashMap is then serialized into the sketch like so:
           row_i: [exponent_i, N_i, fraction_i0, fraction_i1, .., fraction_iN, 
0, .., 0]
   
           - exponent_i: it is the short int value of the bit representation of 
the original double input
           - N_i: we store the number of fractions per exponent value to avoid 
dealing with jagged matrices
       
   NB: Although countDistinct() does not strictly require us to keep track of 
the individual input elements, we must do so in the sketch so we can support 
the union() op for the MULTI_BLOCK case (see notes below)
   
   In the worst case, blkIn can have 1000x1000 = 10e6 unique values. Therefore, 
total memory is bounded above by:
   ```
       Max of
           (10e6 * 2) + (10e6 * 8) bytes   -> 10e6 keys, each with a single 
value
               and 2 + (10e6 * 8) bytes    -> a single key, with unique 10e6 
fraction parts
       = (10e6 * 2) + (10e6 * 8) = 10e6 * 10 bytes < 10 MB
   ```
   
   ## Notes to Reviewers
       - The current implementation only supports the SINGLE_BLOCK case, i.e. 
the current implementation works for input matrices of max size 1000 x 1000. 
Larger inputs (along either dimension) will throw a `NotImplementedException`. 
MULTI_BLOCK aggregation support will be added in a future patch.
       - This patch adds support for only the default RowCol direction; we will 
add support for Row/Col direction in a future patch.
   
   ## Tests
       [X] Unit tests
       [ ] Integration tests
       [ ] N/A
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [systemds] BACtaki opened a new pull request, #1727: [SYSTEMDS-3458] Adding Spark backend support for countDistinct() builtin

Reply via email to