GitHub user zheniasigayev added a comment to the discussion: Best practices for memory-efficient deduplication of pre-sorted Parquet files
Addressing Question 1. The query plan for the original query: ```sql CREATE EXTERNAL TABLE example ( col_1 VARCHAR(50) NOT NULL, col_2 BIGINT NOT NULL, col_3 VARCHAR(50), col_4 VARCHAR(50), col_5 VARCHAR(50), col_6 VARCHAR(100) NOT NULL, col_7 VARCHAR(50), col_8 DOUBLE ) WITH ORDER (col_1 ASC, col_2 ASC) STORED AS PARQUET LOCATION '/tmp/redacted/*.parquet'; EXPLAIN COPY ( SELECT col_1, col_2, col_3, col_4, col_5, col_6, first_value(col_7) AS col_7, first_value(col_8) AS col_8 FROM example GROUP BY col_1, col_2, col_3, col_4, col_5, col_6 ORDER BY col_1 ASC, col_2 ASC ) TO '/tmp/result.parquet' STORED AS PARQUET OPTIONS (compression 'zstd(1)'); ``` The resulting `EXPLAIN` output: ``` +---------------+-------------------------------+ | plan_type | plan | +---------------+-------------------------------+ | physical_plan | ┌───────────────────────────┐ | | | │ DataSinkExec │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ SortPreservingMergeExec │ | | | │ -------------------- │ | | | │ col_1 ASC NULLS LAST, │ | | | │ col_2 ASC NULLS LAST │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ SortExec │ | | | │ -------------------- │ | | | │ col_1@0 ASC NULLS LAST, │ | | | │ col_2@1 ASC NULLS LAST │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ ProjectionExec │ | | | │ -------------------- │ | | | │ col_1: col_1 │ | | | │ col_2: col_2 │ | | | │ col_3: col_3 │ | | | │ col_4: col_4 │ | | | │ col_5: col_5 │ | | | │ col_6: col_6 │ | | | │ │ | | | │ col_7: │ | | | │ first_value(example.col_7)│ | | | │ │ | | | │ col_8: │ | | | │ first_value(example.col_8)│ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ AggregateExec │ | | | │ -------------------- │ | | | │ aggr: │ | | | │ first_value(example.col_7)│ | | | │ , first_value(example │ | | | │ .col_8) │ | | | │ │ | | | │ group_by: │ | | | │ col_1, col_2, col_3, col_4│ | | | │ , col_5, col_6 │ | | | │ │ | | | │ mode: │ | | | │ FinalPartitioned │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ CoalesceBatchesExec │ | | | │ -------------------- │ | | | │ target_batch_size: │ | | | │ 8192 │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ RepartitionExec │ | | | │ -------------------- │ | | | │ partition_count(in->out): │ | | | │ 10 -> 10 │ | | | │ │ | | | │ partitioning_scheme: │ | | | │ Hash([col_1@0, col_2@1, │ | | | │ col_3@2, col_4@3, │ | | | │ col_5@4, col_6@5], │ | | | │ 10) │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ AggregateExec │ | | | │ -------------------- │ | | | │ aggr: │ | | | │ first_value(example.col_7)│ | | | │ , first_value(example │ | | | │ .col_8) │ | | | │ │ | | | │ group_by: │ | | | │ col_1, col_2, col_3, col_4│ | | | │ , col_5, col_6 │ | | | │ │ | | | │ mode: Partial │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ DataSourceExec │ | | | │ -------------------- │ | | | │ files: 24 │ | | | │ format: parquet │ | | | └───────────────────────────┘ | | | | +---------------+-------------------------------+ ``` GitHub link: https://github.com/apache/datafusion/discussions/16776#discussioncomment-13780110 ---- This is an automatically sent email for github@datafusion.apache.org. To unsubscribe, please send an email to: github-unsubscr...@datafusion.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org