GitHub user zheniasigayev added a comment to the discussion: Best practices for memory-efficient deduplication of pre-sorted Parquet files
Both queries use `mode=Partial`. #### Addressing Question / Query 1) ``` +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | CopyTo: format=parquet output_url=/tmp/result.parquet options: (format.compression zstd(1)) | | | Sort: example.col_1 ASC NULLS LAST, example.col_2 ASC NULLS LAST | | | Projection: example.col_1, example.col_2, example.col_3, example.col_4, example.col_5, example.col_6, first_value(example.col_7) AS col_7, first_value(example.col_8) AS col_8 | | | Aggregate: groupBy=[[example.col_1, example.col_2, example.col_3, example.col_4, example.col_5, example.col_6]], aggr=[[first_value(example.col_7), first_value(example.col_8)]] | | | TableScan: example projection=[col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8] | | physical_plan | DataSinkExec: sink=ParquetSink(file_groups=[]) | | | SortPreservingMergeExec: [col_1@0 ASC NULLS LAST, col_2@1 ASC NULLS LAST] | | | SortExec: expr=[col_1@0 ASC NULLS LAST, col_2@1 ASC NULLS LAST], preserve_partitioning=[true] | | | ProjectionExec: expr=[col_1@0 as col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5, col_6@5 as col_6, first_value(example.col_7)@6 as col_7, first_value(example.col_8)@7 as col_8] | | | AggregateExec: mode=FinalPartitioned, gby=[col_1@0 as col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5, col_6@5 as col_6], aggr=[first_value(example.col_7), first_value(example.col_8)] | | | CoalesceBatchesExec: target_batch_size=8192 | | | RepartitionExec: partitioning=Hash([col_1@0, col_2@1, col_3@2, col_4@3, col_5@4, col_6@5], 10), input_partitions=10 | | | AggregateExec: mode=Partial, gby=[col_1@0 as col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5, col_6@5 as col_6], aggr=[first_value(example.col_7), first_value(example.col_8)] | | | DataSourceExec: file_groups={10 groups: [[tmp/redacted/reproducible_data_0.parquet:0..12537303, tmp/redacted/reproducible_data_1.parquet:0..6245726], [tmp/redacted/reproducible_data_1.parquet:6245726..12518047, tmp/redacted/reproducible_data_10.parquet:0..12510708], [tmp/redacted/reproducible_data_10.parquet:12510708..12530931, tmp/redacted/reproducible_data_11.parquet:0..12518206, tmp/redacted/reproducible_data_12.parquet:0..6244600], [tmp/redacted/reproducible_data_12.parquet:6244600..12522074, tmp/redacted/reproducible_data_13.parquet:0..12505555], [tmp/redacted/reproducible_data_13.parquet:12505555..12523871, tmp/redacted/reproducible_data_14.parquet:0..12515635, tmp/redacted/reproducible_data_2.parquet:0..6249078], ...]}, projection=[col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8], file_type=parquet | | | | +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` #### Addressing Question / Query 2) ``` +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | CopyTo: format=parquet output_url=/tmp/result_part2.parquet options: (format.compression zstd(1)) | | | Sort: example.col_1 ASC NULLS LAST, example.col_2 ASC NULLS LAST | | | Aggregate: groupBy=[[example.col_1, example.col_2, example.col_3, example.col_4, example.col_5, example.col_6]], aggr=[[]] | | | TableScan: example projection=[col_1, col_2, col_3, col_4, col_5, col_6] | | physical_plan | DataSinkExec: sink=ParquetSink(file_groups=[]) | | | SortPreservingMergeExec: [col_1@0 ASC NULLS LAST, col_2@1 ASC NULLS LAST] | | | SortExec: expr=[col_1@0 ASC NULLS LAST, col_2@1 ASC NULLS LAST], preserve_partitioning=[true] | | | AggregateExec: mode=FinalPartitioned, gby=[col_1@0 as col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5, col_6@5 as col_6], aggr=[] | | | CoalesceBatchesExec: target_batch_size=8192 | | | RepartitionExec: partitioning=Hash([col_1@0, col_2@1, col_3@2, col_4@3, col_5@4, col_6@5], 10), input_partitions=10 | | | AggregateExec: mode=Partial, gby=[col_1@0 as col_1, col_2@1 as col_2, col_3@2 as col_3, col_4@3 as col_4, col_5@4 as col_5, col_6@5 as col_6], aggr=[] | | | DataSourceExec: file_groups={10 groups: [[tmp/redacted/reproducible_data_0.parquet:0..12537303, tmp/redacted/reproducible_data_1.parquet:0..6245726], [tmp/redacted/reproducible_data_1.parquet:6245726..12518047, tmp/redacted/reproducible_data_10.parquet:0..12510708], [tmp/redacted/reproducible_data_10.parquet:12510708..12530931, tmp/redacted/reproducible_data_11.parquet:0..12518206, tmp/redacted/reproducible_data_12.parquet:0..6244600], [tmp/redacted/reproducible_data_12.parquet:6244600..12522074, tmp/redacted/reproducible_data_13.parquet:0..12505555], [tmp/redacted/reproducible_data_13.parquet:12505555..12523871, tmp/redacted/reproducible_data_14.parquet:0..12515635, tmp/redacted/reproducible_data_2.parquet:0..6249078], ...]}, projection=[col_1, col_2, col_3, col_4, col_5, col_6], file_type=parquet | | | | +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` GitHub link: https://github.com/apache/datafusion/discussions/16776#discussioncomment-13882212 ---- This is an automatically sent email for github@datafusion.apache.org. To unsubscribe, please send an email to: github-unsubscr...@datafusion.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org