Re: spark sql union all is slow

Pei-Lun Lee Tue, 14 Oct 2014 19:32:42 -0700

Hi,

You can merge them into one table by:


sqlContext.unionAll(sqlContext.unionAll(sqlContext.table("table_1"),
sqlContext.table("table_2")),
sqlContext.table("table_3")).registarTempTable("table_all")

Or load them in one call by:

sqlContext.parquetFile("table_1.parquet,table_2.parquet,table_3.parquet").registerTempTable("table_all")

On Wed, Oct 15, 2014 at 2:51 AM, shuluster <s...@turn.com> wrote:

> I have many tables of same schema, they are partitioned by time. For
> example
> one id could be in many of those table. I would like to find aggregation of
> such ids. Originally these tables are located on HDFS as files. Once table
> schemaRDD is loaded, I cacheTable on them. Each table is around 30m - 100m
> serialized data
>
> The SQL I composed looks like the following:
>
> Select id, sum(cost) as cost from (
>
> (((select id, sum(cost) as cost  from table_1
> where id  = 11111 group by id )
> union all
> (select id, sum(cost) as cost  from table_2
> where id  = 11111 group by id ))
> union all
> (select id, sum(cost) as cost  from table_3
> where id  = 11111 group by id )) as temp_table
>
> group by id
>
>
> The call to sparkSqlContext.sql() takes a long time to return a schemaRDD,
> the execution of collect of this RDD was not too slow.
>
> IS there something I am doing wrong here? Or Any tips on how to debug?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-union-all-is-slow-tp16407.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: spark sql union all is slow

Reply via email to