Spark SQL tries to load the entire partition data and organized as In-Memory HashMaps, it does eat large memory if there are not many duplicated group by keys with large amount of records;
Couple of things you can try case by case: · Increasing the partition numbers (the records count in each partition will reduce) · Using large memory for executors (--executor-memory 120g). · Reduce the SPARK COREs (to reduce the parallel running threads) We are trying to approve that by using the sort-merge aggregation, which should reduce the memory utilization significantly, but that’s still on going. Cheng Hao From: Masf [mailto:masfwo...@gmail.com] Sent: Thursday, April 2, 2015 11:47 PM To: user@spark.apache.org Subject: Spark SQL. Memory consumption Hi. I'm using Spark SQL 1.2. I have this query: CREATE TABLE test_MA STORED AS PARQUET AS SELECT field1 ,field2 ,field3 ,field4 ,field5 ,COUNT(1) AS field6 ,MAX(field7) ,MIN(field8) ,SUM(field9 / 100) ,COUNT(field10) ,SUM(IF(field11 < -500, 1, 0)) ,MAX(field12) ,SUM(IF(field13 = 1, 1, 0)) ,SUM(IF(field13 in (3,4,5,6,10,104,105,107), 1, 0)) ,SUM(IF(field13 = 2012 , 1, 0)) ,SUM(IF(field13 in (0,100,101,102,103,106), 1, 0)) FROM table1 CL JOIN table2 netw ON CL.field15 = netw.id<http://netw.id> WHERE AND field3 IS NOT NULL AND field4 IS NOT NULL AND field5 IS NOT NULL GROUP BY field1,field2,field3,field4, netw.field5 spark-submit --master spark://master:7077 --driver-memory 20g --executor-memory 60g --class "GMain" project_2.10-1.0.jar --driver-class-path '/opt/cloudera/parcels/CDH/lib/hive/lib/*' --driver-java-options '-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*' 2> ./error Input data is 8GB in parquet format. Many times crash by GC overhead. I've fixed spark.shuffle.partitions to 1024 but my worker nodes (with 128GB RAM/node) is collapsed. Is it a query too difficult to Spark SQL? Would It be better to do it in Spark? Am I doing something wrong? Thanks -- Regards. Miguel Ángel