Re: [PR] [GLUTEN-4889][VL] Support approx_percentile [incubator-gluten]
WangGuangxin commented on code in PR #5007: URL: https://github.com/apache/incubator-gluten/pull/5007#discussion_r1593888422 ## gluten-core/src/main/scala/io/glutenproject/extension/columnar/RewriteTypedImperativeAggregate.scala: ## @@ -67,6 +100,28 @@ object RewriteTypedImperativeAggregate extends Rule[SparkPlan] with PullOutProje } copyBaseAggregateExec(agg)(newResultExpressions = newResultExpressions) + case agg: BaseAggregateExec + if agg.aggregateExpressions.exists(shouldRewriteForPercentileLikeExpr) => +val exprMap = agg.aggregateExpressions + .filter(shouldRewriteForPercentileLikeExpr) + .map(ae => ae.aggregateFunction.inputAggBufferAttributes.head -> ae) + .toMap +val newResultExpressions = agg.resultExpressions.map { + case attr: AttributeReference => +exprMap + .get(attr) + .map { +ae => + attr.copy(dataType = getPercentileLikeInterminateDataType(ae.aggregateFunction))( +exprId = attr.exprId, +qualifier = attr.qualifier + ) + } + .getOrElse(attr) + case other => other +} +copyBaseAggregateExec(agg)(newResultExpressions = newResultExpressions) Review Comment: ok, I'll rebase and refact it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-4889][VL] Support approx_percentile [incubator-gluten]
zhztheplayer commented on code in PR #5007: URL: https://github.com/apache/incubator-gluten/pull/5007#discussion_r1591954924 ## gluten-core/src/main/scala/io/glutenproject/extension/columnar/RewriteTypedImperativeAggregate.scala: ## @@ -67,6 +100,28 @@ object RewriteTypedImperativeAggregate extends Rule[SparkPlan] with PullOutProje } copyBaseAggregateExec(agg)(newResultExpressions = newResultExpressions) + case agg: BaseAggregateExec + if agg.aggregateExpressions.exists(shouldRewriteForPercentileLikeExpr) => +val exprMap = agg.aggregateExpressions + .filter(shouldRewriteForPercentileLikeExpr) + .map(ae => ae.aggregateFunction.inputAggBufferAttributes.head -> ae) + .toMap +val newResultExpressions = agg.resultExpressions.map { + case attr: AttributeReference => +exprMap + .get(attr) + .map { +ae => + attr.copy(dataType = getPercentileLikeInterminateDataType(ae.aggregateFunction))( +exprId = attr.exprId, +qualifier = attr.qualifier + ) + } + .getOrElse(attr) + case other => other +} +copyBaseAggregateExec(agg)(newResultExpressions = newResultExpressions) Review Comment: IIUC this could make the offloaded partial aggregate operator incompatible with vanilla Spark's final aggregate operator? Say if final aggregation is fallen back then UB will be led. I am considering whether we could use the approach like we had done for `approx_count_distinct` (https://github.com/apache/incubator-gluten/pull/1676), to replace `approx_percentile` with something like `velox_approx_percentile` that has different intermediate type definition (or use an internal project to match up with vanilla's intermediate type) at the very early planning phase? Then we can distinguish between vanilla's and Velox's implementations for this function. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-4889][VL] Support approx_percentile [incubator-gluten]
zhztheplayer commented on code in PR #5007: URL: https://github.com/apache/incubator-gluten/pull/5007#discussion_r1591954924 ## gluten-core/src/main/scala/io/glutenproject/extension/columnar/RewriteTypedImperativeAggregate.scala: ## @@ -67,6 +100,28 @@ object RewriteTypedImperativeAggregate extends Rule[SparkPlan] with PullOutProje } copyBaseAggregateExec(agg)(newResultExpressions = newResultExpressions) + case agg: BaseAggregateExec + if agg.aggregateExpressions.exists(shouldRewriteForPercentileLikeExpr) => +val exprMap = agg.aggregateExpressions + .filter(shouldRewriteForPercentileLikeExpr) + .map(ae => ae.aggregateFunction.inputAggBufferAttributes.head -> ae) + .toMap +val newResultExpressions = agg.resultExpressions.map { + case attr: AttributeReference => +exprMap + .get(attr) + .map { +ae => + attr.copy(dataType = getPercentileLikeInterminateDataType(ae.aggregateFunction))( +exprId = attr.exprId, +qualifier = attr.qualifier + ) + } + .getOrElse(attr) + case other => other +} +copyBaseAggregateExec(agg)(newResultExpressions = newResultExpressions) Review Comment: IIUC this could make the offloaded partial aggregate operator incompatible with vanilla Spark's final aggregate operator? Say if final aggregation is fallen back then UB will be led. I am considering whether we could use the approach like we had done for `approx_count_distinct`, to replace `approx_percentile` with something like `velox_approx_percentile` that has different intermediate type definition at the very early planning phase? Then we can distinguish between vanilla's and Velox's implementations for this function. ## gluten-core/src/main/scala/io/glutenproject/extension/columnar/RewriteTypedImperativeAggregate.scala: ## @@ -67,6 +100,28 @@ object RewriteTypedImperativeAggregate extends Rule[SparkPlan] with PullOutProje } copyBaseAggregateExec(agg)(newResultExpressions = newResultExpressions) + case agg: BaseAggregateExec + if agg.aggregateExpressions.exists(shouldRewriteForPercentileLikeExpr) => +val exprMap = agg.aggregateExpressions + .filter(shouldRewriteForPercentileLikeExpr) + .map(ae => ae.aggregateFunction.inputAggBufferAttributes.head -> ae) + .toMap +val newResultExpressions = agg.resultExpressions.map { + case attr: AttributeReference => +exprMap + .get(attr) + .map { +ae => + attr.copy(dataType = getPercentileLikeInterminateDataType(ae.aggregateFunction))( +exprId = attr.exprId, +qualifier = attr.qualifier + ) + } + .getOrElse(attr) + case other => other +} +copyBaseAggregateExec(agg)(newResultExpressions = newResultExpressions) Review Comment: IIUC this could make the offloaded partial aggregate operator incompatible with vanilla Spark's final aggregate operator? Say if final aggregation is fallen back then UB will be led. I am considering whether we could use the approach like we had done for `approx_count_distinct` (https://github.com/apache/incubator-gluten/pull/1676), to replace `approx_percentile` with something like `velox_approx_percentile` that has different intermediate type definition at the very early planning phase? Then we can distinguish between vanilla's and Velox's implementations for this function. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-4889][VL] Support approx_percentile [incubator-gluten]
zhztheplayer commented on code in PR #5007: URL: https://github.com/apache/incubator-gluten/pull/5007#discussion_r1591954924 ## gluten-core/src/main/scala/io/glutenproject/extension/columnar/RewriteTypedImperativeAggregate.scala: ## @@ -67,6 +100,28 @@ object RewriteTypedImperativeAggregate extends Rule[SparkPlan] with PullOutProje } copyBaseAggregateExec(agg)(newResultExpressions = newResultExpressions) + case agg: BaseAggregateExec + if agg.aggregateExpressions.exists(shouldRewriteForPercentileLikeExpr) => +val exprMap = agg.aggregateExpressions + .filter(shouldRewriteForPercentileLikeExpr) + .map(ae => ae.aggregateFunction.inputAggBufferAttributes.head -> ae) + .toMap +val newResultExpressions = agg.resultExpressions.map { + case attr: AttributeReference => +exprMap + .get(attr) + .map { +ae => + attr.copy(dataType = getPercentileLikeInterminateDataType(ae.aggregateFunction))( +exprId = attr.exprId, +qualifier = attr.qualifier + ) + } + .getOrElse(attr) + case other => other +} +copyBaseAggregateExec(agg)(newResultExpressions = newResultExpressions) Review Comment: IIUC this could make the offloaded partial aggregate operator incompatible with vanilla Spark's final aggregate operator? Say if final aggregation is fallen back then UB will be led. I am considering whether we could using the approach like we had done for `approx_count_distinct`, to replace `approx_percentile` with something like `velox_approx_percentile` that has different intermediate type definition at the very early planning phase? Then we can distinguish between vanilla's and Velox's implementations for this function. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-4889][VL] Support approx_percentile [incubator-gluten]
liujiayi771 commented on PR #5007: URL: https://github.com/apache/incubator-gluten/pull/5007#issuecomment-2076680006 @WangGuangxin Any progress of this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-4889][VL] Support approx_percentile [incubator-gluten]
WangGuangxin commented on code in PR #5007: URL: https://github.com/apache/incubator-gluten/pull/5007#discussion_r1538532110 ## gluten-core/src/main/scala/io/glutenproject/extension/columnar/RewriteTypedImperativeAggregate.scala: ## @@ -40,6 +41,38 @@ object RewriteTypedImperativeAggregate extends Rule[SparkPlan] with PullOutProje } } + def shouldRewriteForPercentileLikeExpr(ae: AggregateExpression): Boolean = { +ae.aggregateFunction match { + case _: ApproximatePercentile => +ae.mode match { + case Partial | PartialMerge => true Review Comment: yeah, it should be removed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-4889][VL] Support approx_percentile [incubator-gluten]
liujiayi771 commented on code in PR #5007: URL: https://github.com/apache/incubator-gluten/pull/5007#discussion_r1536728309 ## gluten-core/src/main/scala/io/glutenproject/extension/columnar/RewriteTypedImperativeAggregate.scala: ## @@ -40,6 +41,38 @@ object RewriteTypedImperativeAggregate extends Rule[SparkPlan] with PullOutProje } } + def shouldRewriteForPercentileLikeExpr(ae: AggregateExpression): Boolean = { +ae.aggregateFunction match { + case _: ApproximatePercentile => +ae.mode match { + case Partial | PartialMerge => true Review Comment: It only support `Partial` and `Final` in `HashAggregateExecTransformer.scala`. Shall we add `PartialMerge` here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-4889][VL] Support approx_percentile [incubator-gluten]
WangGuangxin commented on PR #5007: URL: https://github.com/apache/incubator-gluten/pull/5007#issuecomment-2011045153 > this yes, we should merge to upsteam first before merge this PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-4889][VL] Support approx_percentile [incubator-gluten]
zhztheplayer commented on PR #5007: URL: https://github.com/apache/incubator-gluten/pull/5007#issuecomment-2010990615 Hi @WangGuangxin so the current plan is to merge https://github.com/WangGuangxin/gluten/commit/97db869a56e8d42c0782e4e8218e1937e40682ce to upstream Velox before this ? Am I understanding correctly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-4889][VL] Support approx_percentile [incubator-gluten]
liujiayi771 commented on PR #5007: URL: https://github.com/apache/incubator-gluten/pull/5007#issuecomment-2008652742 @WangGuangxin Perhaps we could create an issue in the upstream Velox community to remove this verification. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-4889][VL] Support approx_percentile [incubator-gluten]
zhztheplayer commented on code in PR #5007: URL: https://github.com/apache/incubator-gluten/pull/5007#discussion_r1531488552 ## backends-velox/src/test/scala/io/glutenproject/execution/TestOperator.scala: ## @@ -1233,4 +1233,12 @@ class TestOperator extends VeloxWholeStageTransformerSuite { checkOperatorMatch[HashAggregateExecTransformer] } } + + test("Support ApproximatePercentile") { +runQueryAndCompare(""" + |SELECT approx_percentile(col, array(0.5, 0.4, 0.1), 100) Review Comment: I remember Spark has `percentile_approx`. Is that one supported by this patch either? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-4889][VL] Support approx_percentile [incubator-gluten]
zhztheplayer commented on PR #5007: URL: https://github.com/apache/incubator-gluten/pull/5007#issuecomment-2008617781 Thank you in advance! > I believe the change on Gluten side is ready now. Did that mean this PR can be merged prior to merging Velox changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-4889][VL] Support approx_percentile [incubator-gluten]
WangGuangxin commented on PR #5007: URL: https://github.com/apache/incubator-gluten/pull/5007#issuecomment-2008557794 I believe the change on Gluten side is ready now. But the `ApproxPercentileAggregate` implements in velox has to make some minor modification, becasue when add intermediate data, it checks the row vector encoding in `addIntermediateImpl` https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/aggregates/ApproxPercentileAggregate.cpp#L652 , I believe the encoding is guaranteed by `extractAccumulators` https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/aggregates/ApproxPercentileAggregate.cpp#L262. But when it comes in Gluten, the output of PartialAgg are processed by Shuffle, which will flatten all vectors, so that the encoding is not guaranteed when added to FinalAgg. https://github.com/apache/incubator-gluten/blob/main/cpp/velox/shuffle/VeloxShuffleWriter.cc#L343 cc @zhztheplayer @liujiayi771 @ulysses-you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-4889][VL] Support approx_percentile [incubator-gluten]
zhouyuan commented on PR #5007: URL: https://github.com/apache/incubator-gluten/pull/5007#issuecomment-2008478357 CC: @zhztheplayer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-4889][VL] Support approx_percentile [incubator-gluten]
github-actions[bot] commented on PR #5007: URL: https://github.com/apache/incubator-gluten/pull/5007#issuecomment-2005348332 Run Gluten Clickhouse CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-4889][VL] Support approx_percentile [incubator-gluten]
github-actions[bot] commented on PR #5007: URL: https://github.com/apache/incubator-gluten/pull/5007#issuecomment-2005347612 https://github.com/apache/incubator-gluten/issues/4889 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org