Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-4787: Optimize APPX_MEDIAN() memory usage
......................................................................


Patch Set 2:

(17 comments)

this probably needs some new test cases, especially cases where there are 
multiple pre-aggs where, e.g. one has very few rows and another has 20k or more 
rows. I think this may break the merge. This might be doable by having a table 
with files of significantly varying sizes.

http://gerrit.cloudera.org:8080/#/c/6025/1/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

Line 974: 
> Done
it'd be good to add some bounds dchecks though


http://gerrit.cloudera.org:8080/#/c/6025/2/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

PS2, Line 963: max_num_samples
confusing to have MAX_NUM_SAMPLES as well. maybe this should be capacity


PS2, Line 973: { return begin() + idx; }
it'd be good to add bounds checks w/ dchecks


PS2, Line 984: max_num_samples
capacity


Line 996:   bool IsFull() const {
dcheck that num_samples <= capacity, i.e. that it is NOT over


PS2, Line 990:   size_t byte_size() const {
             :     return byte_size(max_num_samples);
             :   }
             : 
             :   // Returns true if the array of ReservoirSamples that follows 
this struct in memory is
             :   // full, and no more elements can be pushed back without 
resizing.
             :   bool IsFull() const {
             :     return num_samples == max_num_samples;
             :   }
i think both of these can be one line


PS2, Line 1001: // make this const
?


PS2, Line 1018: resizeReservoirSampleState
nit: inconsistent casing in fn names


PS2, Line 1021: N_
NUM


PS2, Line 1024: new_size 
new_capacity to avoid conflating with memory sizes


PS2, Line 1024:   int new_size = state->max_num_samples * 10;
              :   DCHECK_EQ(state->max_num_samples, state->num_samples);
              :   DCHECK_LE(new_size, MAX_NUM_SAMPLES);
Looks like it'll dcheck when capacity reaches MAX_NUM_SAMPLES. Shouldn't this 
be taking the min() or something like that?


PS2, Line 1036:   state->max_num_samples = new_size;
why don't any other fields need to be initialized? e.g. the rng will be garbage.

I also think you need to copy over the prev 'source_size'.


PS2, Line 1048: size
again I'd vote for capacity


PS2, Line 1058:   *state = ReservoirSampleState<T>();
this looks like it'll initialize the memory properly. I think we'd probably 
need to do this at l1035


PS2, Line 1084:   ++state->source_size;
I think this gets lost when you allocate a new State.


PS2, Line 1152: ReservoirSampleMerge
I think this algorithm will require some changes to handle varying length 
States.


PS2, Line 1167: (dst->num_samples < MAX_NUM_SAMPLES
I don't think you can rely on this anymore, dst may not have capacity 
MAX_NUM_SAMPLES


-- 
To view, visit http://gerrit.cloudera.org:8080/6025
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I99adaad574d4fb0a3cf38c6cbad8b2a23df12968
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tbobrovyt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tbobrovyt...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to