[GitHub] [spark] rednaxelafx edited a comment on issue #20965: [SPARK-21870][SQL] Split aggregation code into small functions

GitBox Wed, 28 Aug 2019 00:00:53 -0700

rednaxelafx edited a comment on issue #20965: [SPARK-21870][SQL] Split 
aggregation code into small functions
URL: https://github.com/apache/spark/pull/20965#issuecomment-525603727
 
 
   Thanks for working on this PR, @maropu ! The new generated code looks a lot 
better!
   I'll go into more detailed review soon. Three quick questions:
   1. although it probably won't change your example generated code, but for 
known not-null values extracted from locals/CSE state, when you're constructing 
the split-off functions, could you make sure that you don't pass redundant 
`isNull` argument into the split-off function? That could save a slot for each 
not-null value you pass down. It's sort of like how @viirya 's construct 
consume function works.
   2. Can we be more precise about which aggregate functions are referencing 
which extracted locals, and only pass relevant locals down into the split-off 
aggregate function? If you have a lot of different things, e.g. `sum(a), 
avg(a), sum(b), avg(b), sum(c), avg(c)`, then if we don't make it more precise, 
we'd be passing a long list of stuff down to each split-off function for no 
good.
   3. Can we only trigger the actual split after some certain threshold of the 
total generated code for aggregate functions is hit? i.e. when it's just a 
couple of short aggregate functions, generate your new code but don't split off 
into separate functions; when the total size of the generated code for 
aggregate function goes above some threshold, trigger your actual split. The 
conf for controlling whether or not to split can be put along side this 
threshold check, something like: `if (confEnabled && totalSize > threshold) 
splitAggregateBlahBlahBlah(generatedAggregateFuncCodes) else 
generatedAggregateFuncCodes.mkString("\n")`
   
   Another thing I'd like to point out with the updated PR description is that, 
under the "generated code in the current master" line, the example generated 
code doesn't look right: it looks like you're quoting the generated code from 
the final aggregate instead of the partial aggregate, because the inputs look 
like aggregate buffer slot values to me. You intended to quote the partial 
aggregate code, right?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] rednaxelafx edited a comment on issue #20965: [SPARK-21870][SQL] Split aggregation code into small functions

Reply via email to