Hi.

I'm using Spark SQL 1.2. I have this query:

CREATE TABLE test_MA STORED AS PARQUET AS
 SELECT
field1
,field2
,field3
,field4
,field5
,COUNT(1) AS field6
,MAX(field7)
,MIN(field8)
,SUM(field9 / 100)
,COUNT(field10)
,SUM(IF(field11 < -500, 1, 0))
,MAX(field12)
,SUM(IF(field13 = 1, 1, 0))
,SUM(IF(field13 in (3,4,5,6,10,104,105,107), 1, 0))
,SUM(IF(field13 = 2012 , 1, 0))
,SUM(IF(field13 in (0,100,101,102,103,106), 1, 0))
 FROM table1 CL
    JOIN table2 netw
ON CL.field15 = netw.id
WHERE
AND field3 IS NOT NULL
AND field4 IS NOT NULL
AND field5 IS NOT NULL
GROUP BY field1,field2,field3,field4, netw.field5


spark-submit --master spark://master:7077 *--driver-memory 20g
--executor-memory 60g* --class "GMain" project_2.10-1.0.jar
--driver-class-path '/opt/cloudera/parcels/CDH/lib/hive/lib/*'
--driver-java-options
'-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*'
2> ./error


Input data is 8GB in parquet format. Many times crash by *GC overhead*.
I've fixed spark.shuffle.partitions to 1024 but my worker nodes (with 128GB
RAM/node) is collapsed.

*Is it a query too difficult to Spark SQL? *
*Would It be better to do it in Spark?*
*Am I doing something wrong?*


Thanks
-- 


Regards.
Miguel Ángel

Reply via email to