Hi there, I was trying the new DataFrame API with some basic operations on a parquet dataset. I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in a standalone cluster mode. The code is the following:
val people = sqlContext.parquetFile("/data.parquet"); val res = people.groupBy("name","date").agg(sum("power"),sum("supply")).take(10); System.out.println(res); The dataset consists of 16 billion entries. The error I get is java.lang.OutOfMemoryError: GC overhead limit exceeded My configuration is: spark.serializer org.apache.spark.serializer.KryoSerializer spark.driver.memory 6g spark.executor.extraJavaOptions -XX:+UseCompressedOops spark.shuffle.manager sort Any idea how can I workaround this? Thanks a lot