I am working on profiling TPCH queries for Spark 2.0. I see lot of temporary object creation (sometimes size as much as the data size) which is justified for the kind of processing Spark does. But, from production perspective, is there a guideline on how much memory should be allocated for processing a specific data size of let's say parquet data? Also, has someone investigated memory usage for the individual SQL operators like Filter, group by, order by, Exchange etc.?
Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io