Mark1626 commented on PR #21506:
URL: https://github.com/apache/datafusion/pull/21506#issuecomment-4372104467

   @Dandandan  Sorry I was busy with work and couldn't get back on this 
message. In the benchmarks and in general the flow SkippingAggregation doesn't 
seem to be picked up often. I use a custom Agg operator which dynamic switching 
to this flow in my production workload of about 1TB+ data.
   
   However, even without my custom agg operator, I think the defaults used for 
the switch are a bit too conservative and I see a huge difference changing 
them. I ran the following query with different defaults. My production query 
has about 50+ aggregations in the projection and multiple grouping columns, 
this is a simpler version of it on the Clickbench data
   
   ```
   SELECT
       Hits."UserID", Hits."RegionID", Hits."URLHash",
       SUM(Hits."IsRefresh"), SUM(Hits."ResolutionWidth"), 
SUM(Hits."ResolutionHeight"),
       SUM(Hits."ResolutionDepth"), SUM(Hits."AdvEngineID"), 
SUM(Hits."WindowClientWidth"),
       SUM(Hits."WindowClientHeight"), SUM(Hits."ClientTimeZone"), 
SUM(Hits."Age"),
       SUM(Hits."Sex"), SUM(Hits."Income"), SUM(Hits."Interests"),
       SUM(Hits."Robotness"), SUM(Hits."HistoryLength"), SUM(Hits."HTTPError"),
       SUM(Hits."SendTiming"), SUM(Hits."DNSTiming"), SUM(Hits."ConnectTiming"),
       SUM(Hits."ResponseStartTiming"), SUM(Hits."ResponseEndTiming"), 
SUM(Hits."FetchTiming"),
       SUM(Hits."SocialSourceNetworkID"), SUM(Hits."ParamPrice"), 
SUM(Hits."ParamCurrencyID"),
       SUM(Hits."HasGCLID"), SUM(Hits."CLID"), SUM(Hits."GoodEvent"),
       SUM(Hits."CounterClass"), SUM(Hits."OS"), SUM(Hits."JavaEnable")
   FROM hits
   GROUP BY Hits."UserID", Hits."RegionID", Hits."URLHash"
   ORDER BY Hits."UserID" LIMIT 1;
   ```
   
   Machine used - Apple M4 Pro, 48GB Memory
   
   ## Without the changes
   
   ```
   DataFusion CLI v53.0.0
   > CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 
'/Users/nimalanm/app/e6-parquet-benchmarks/schema/hits';
   0 row(s) fetched.
   Elapsed 0.049 seconds.
   
   
   > SELECT
       Hits."UserID", Hits."RegionID", Hits."URLHash",
       SUM(Hits."IsRefresh"), SUM(Hits."ResolutionWidth"), 
SUM(Hits."ResolutionHeight"),
       SUM(Hits."ResolutionDepth"), SUM(Hits."AdvEngineID"), 
SUM(Hits."WindowClientWidth"),
       SUM(Hits."WindowClientHeight"), SUM(Hits."ClientTimeZone"), 
SUM(Hits."Age"),
       SUM(Hits."Sex"), SUM(Hits."Income"), SUM(Hits."Interests"),
       SUM(Hits."Robotness"), SUM(Hits."HistoryLength"), SUM(Hits."HTTPError"),
       SUM(Hits."SendTiming"), SUM(Hits."DNSTiming"), SUM(Hits."ConnectTiming"),
       SUM(Hits."ResponseStartTiming"), SUM(Hits."ResponseEndTiming"), 
SUM(Hits."FetchTiming"),
       SUM(Hits."SocialSourceNetworkID"), SUM(Hits."ParamPrice"), 
SUM(Hits."ParamCurrencyID"),
       SUM(Hits."HasGCLID"), SUM(Hits."CLID"), SUM(Hits."GoodEvent"),
       SUM(Hits."CounterClass"), SUM(Hits."OS"), SUM(Hits."JavaEnable")
   FROM hits
   GROUP BY Hits."UserID", Hits."RegionID", Hits."URLHash"
   ORDER BY Hits."UserID" LIMIT 1;
   
   ....
   ....
   ....
   ....
   
   1 row(s) fetched.
   Elapsed 10.471 seconds.
   
   > SET datafusion.execution.skip_partial_aggregation_probe_rows_threshold = 
1000;
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   > SET datafusion.execution.skip_partial_aggregation_probe_ratio_threshold = 
0.0;
   0 row(s) fetched.
   Elapsed 0.002 seconds.
   
   > SELECT
       Hits."UserID", Hits."RegionID", Hits."URLHash",
       SUM(Hits."IsRefresh"), SUM(Hits."ResolutionWidth"), 
SUM(Hits."ResolutionHeight"),
       SUM(Hits."ResolutionDepth"), SUM(Hits."AdvEngineID"), 
SUM(Hits."WindowClientWidth"),
       SUM(Hits."WindowClientHeight"), SUM(Hits."ClientTimeZone"), 
SUM(Hits."Age"),
       SUM(Hits."Sex"), SUM(Hits."Income"), SUM(Hits."Interests"),
       SUM(Hits."Robotness"), SUM(Hits."HistoryLength"), SUM(Hits."HTTPError"),
       SUM(Hits."SendTiming"), SUM(Hits."DNSTiming"), SUM(Hits."ConnectTiming"),
       SUM(Hits."ResponseStartTiming"), SUM(Hits."ResponseEndTiming"), 
SUM(Hits."FetchTiming"),
       SUM(Hits."SocialSourceNetworkID"), SUM(Hits."ParamPrice"), 
SUM(Hits."ParamCurrencyID"),
       SUM(Hits."HasGCLID"), SUM(Hits."CLID"), SUM(Hits."GoodEvent"),
       SUM(Hits."CounterClass"), SUM(Hits."OS"), SUM(Hits."JavaEnable")
   FROM hits
   GROUP BY Hits."UserID", Hits."RegionID", Hits."URLHash"
   ORDER BY Hits."UserID" LIMIT 1;
   
   ....
   ....
   ....
   ....
   
   1 row(s) fetched.
   Elapsed 6.033 seconds.
   ```
   
   ---
   
   ## With the changes
   
   ```
   > SELECT
       Hits."UserID", Hits."RegionID", Hits."URLHash",
       SUM(Hits."IsRefresh"), SUM(Hits."ResolutionWidth"), 
SUM(Hits."ResolutionHeight"),
       SUM(Hits."ResolutionDepth"), SUM(Hits."AdvEngineID"), 
SUM(Hits."WindowClientWidth"),
       SUM(Hits."WindowClientHeight"), SUM(Hits."ClientTimeZone"), 
SUM(Hits."Age"),
       SUM(Hits."Sex"), SUM(Hits."Income"), SUM(Hits."Interests"),
       SUM(Hits."Robotness"), SUM(Hits."HistoryLength"), SUM(Hits."HTTPError"),
       SUM(Hits."SendTiming"), SUM(Hits."DNSTiming"), SUM(Hits."ConnectTiming"),
       SUM(Hits."ResponseStartTiming"), SUM(Hits."ResponseEndTiming"), 
SUM(Hits."FetchTiming"),
       SUM(Hits."SocialSourceNetworkID"), SUM(Hits."ParamPrice"), 
SUM(Hits."ParamCurrencyID"),
       SUM(Hits."HasGCLID"), SUM(Hits."CLID"), SUM(Hits."GoodEvent"),
       SUM(Hits."CounterClass"), SUM(Hits."OS"), SUM(Hits."JavaEnable")
   FROM hits
   GROUP BY Hits."UserID", Hits."RegionID", Hits."URLHash"
   ORDER BY Hits."UserID" LIMIT 1;
   
   ....
   ....
   ....
   ....
   
   1 row(s) fetched.
   Elapsed 5.644 seconds.
   ```
   
   I'm not able to exactly prove the difference as the scale of data is 
different, but at a larger scale of data the allocation and the arrow kernel 
were visible in my flamegraph. Any thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to