Hi, You can merge them into one table by:
sqlContext.unionAll(sqlContext.unionAll(sqlContext.table("table_1"), sqlContext.table("table_2")), sqlContext.table("table_3")).registarTempTable("table_all") Or load them in one call by: sqlContext.parquetFile("table_1.parquet,table_2.parquet,table_3.parquet").registerTempTable("table_all") On Wed, Oct 15, 2014 at 2:51 AM, shuluster <s...@turn.com> wrote: > I have many tables of same schema, they are partitioned by time. For > example > one id could be in many of those table. I would like to find aggregation of > such ids. Originally these tables are located on HDFS as files. Once table > schemaRDD is loaded, I cacheTable on them. Each table is around 30m - 100m > serialized data > > The SQL I composed looks like the following: > > Select id, sum(cost) as cost from ( > > (((select id, sum(cost) as cost from table_1 > where id = 11111 group by id ) > union all > (select id, sum(cost) as cost from table_2 > where id = 11111 group by id )) > union all > (select id, sum(cost) as cost from table_3 > where id = 11111 group by id )) as temp_table > > group by id > > > The call to sparkSqlContext.sql() takes a long time to return a schemaRDD, > the execution of collect of this RDD was not too slow. > > IS there something I am doing wrong here? Or Any tips on how to debug? > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-union-all-is-slow-tp16407.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >