Re: PartitionBy and SortWithinPartitions

2022-06-03 Thread Nikhil Goyal
Hi Enrico, Thanks for replying. I want to partition by a column and then be able to sort within those partitions based on another column. DataframeWriter has sortBy and bucketBy but it requires creating a new table (Can only use `saveAsTable` but not just `save`). I can write another job on top

Re: PartitionBy and SortWithinPartitions

2022-06-03 Thread Enrico Minack
Nikhil, What are you trying to achieve with this in the first place? What are your goals? What is the problem with your approach? Are you concerned about the 1000 files in each written col2-partition? The write.partitionBy is something different that df.repartition or df.coalesce. The df

PartitionBy and SortWithinPartitions

2022-06-03 Thread Nikhil Goyal
Hi folks, We are trying to do ` df.coalesce(1000).sortWithinPartitions("col1").write.mode('overwrite').partitionBy("col2").parquet(...) ` I do see that coalesce 1000 is applied for every sub partition. But I wanted to know if sortWithinPartitions(col1) works after applying partitionBy or before?