Hi, I am running a Spark Dataframe function of NTILE over a huge data - it spills lot of data while sorting and eventually it fails.
The data size is roughly 80 Million record with size of 4G (not sure whether its serialized or deserialized) - I am calculating NTILE(10) for all these records order by one metric. Few stats are below, I need help in finding alternatives or anyone did some benchmarking of highest load this function can handle ? The below snapshot is the calculation of NTILE for two columns separately - each runs and that final 1 partition is where the complete data is present - meaning, Window function moves all to 1 final partition to compute NTILE - which is 80M in my case. [image: Screen Shot 2019-06-13 at 12.51.56 AM.jpg] Executor memory is 8G - with shuffle.storageMemory of 0.8 => so it is 5.5G So ideally 80M records I saw inside the stage level metrics - it shows as below, *Shuffle read size / records* [image: Screen Shot 2019-06-13 at 12.51.33 AM.jpg] Is there any alternative or is it not feasible to perform this operation in Spark SQL functions ? Thanks, Subash