I have many tables of same schema, they are partitioned by time. For example
one id could be in many of those table. I would like to find aggregation of
such ids. Originally these tables are located on HDFS as files. Once table
schemaRDD is loaded, I cacheTable on them. Each table is around 30m - 100m
serialized data

The SQL I composed looks like the following:

Select id, sum(cost) as cost from (

(((select id, sum(cost) as cost  from table_1 
where id  = 11111 group by id )
union all 
(select id, sum(cost) as cost  from table_2 
where id  = 11111 group by id ))
union all 
(select id, sum(cost) as cost  from table_3 
where id  = 11111 group by id )) as temp_table

group by id


The call to sparkSqlContext.sql() takes a long time to return a schemaRDD,
the execution of collect of this RDD was not too slow.

IS there something I am doing wrong here? Or Any tips on how to debug?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-union-all-is-slow-tp16407.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to