Matthew Jacobs has posted comments on this change. Change subject: IMPALA-4787: Optimize APPX_MEDIAN() memory usage ......................................................................
Patch Set 2: (17 comments) this probably needs some new test cases, especially cases where there are multiple pre-aggs where, e.g. one has very few rows and another has 20k or more rows. I think this may break the merge. This might be doable by having a table with files of significantly varying sizes. http://gerrit.cloudera.org:8080/#/c/6025/1/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: Line 974: > Done it'd be good to add some bounds dchecks though http://gerrit.cloudera.org:8080/#/c/6025/2/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: PS2, Line 963: max_num_samples confusing to have MAX_NUM_SAMPLES as well. maybe this should be capacity PS2, Line 973: { return begin() + idx; } it'd be good to add bounds checks w/ dchecks PS2, Line 984: max_num_samples capacity Line 996: bool IsFull() const { dcheck that num_samples <= capacity, i.e. that it is NOT over PS2, Line 990: size_t byte_size() const { : return byte_size(max_num_samples); : } : : // Returns true if the array of ReservoirSamples that follows this struct in memory is : // full, and no more elements can be pushed back without resizing. : bool IsFull() const { : return num_samples == max_num_samples; : } i think both of these can be one line PS2, Line 1001: // make this const ? PS2, Line 1018: resizeReservoirSampleState nit: inconsistent casing in fn names PS2, Line 1021: N_ NUM PS2, Line 1024: new_size new_capacity to avoid conflating with memory sizes PS2, Line 1024: int new_size = state->max_num_samples * 10; : DCHECK_EQ(state->max_num_samples, state->num_samples); : DCHECK_LE(new_size, MAX_NUM_SAMPLES); Looks like it'll dcheck when capacity reaches MAX_NUM_SAMPLES. Shouldn't this be taking the min() or something like that? PS2, Line 1036: state->max_num_samples = new_size; why don't any other fields need to be initialized? e.g. the rng will be garbage. I also think you need to copy over the prev 'source_size'. PS2, Line 1048: size again I'd vote for capacity PS2, Line 1058: *state = ReservoirSampleState<T>(); this looks like it'll initialize the memory properly. I think we'd probably need to do this at l1035 PS2, Line 1084: ++state->source_size; I think this gets lost when you allocate a new State. PS2, Line 1152: ReservoirSampleMerge I think this algorithm will require some changes to handle varying length States. PS2, Line 1167: (dst->num_samples < MAX_NUM_SAMPLES I don't think you can rely on this anymore, dst may not have capacity MAX_NUM_SAMPLES -- To view, visit http://gerrit.cloudera.org:8080/6025 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I99adaad574d4fb0a3cf38c6cbad8b2a23df12968 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky <tbobrovyt...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com> Gerrit-Reviewer: Taras Bobrovytsky <tbobrovyt...@cloudera.com> Gerrit-HasComments: Yes