Re: Spark SQL performance: version 1.6 vs version 1.5

2016-02-12 Thread Le Tien Dung
Hi Herman,

We are very happy to receive your mail. Indeed, we can revert to the
old behaviour of Spark SQL (the performance and the DAG are the same in
both version).

Many thanks and have a nice weekend,
Tien-Dung

PS: In order to revert, the setting value should be "true".

On Fri, Feb 12, 2016 at 4:51 PM, Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> Hi Tien-Dung,
>
> 1.6 plans single distinct aggregates like multiple distinct aggregates;
> this inherently causes some overhead but is more stable in case of high
> cardinalities. You can revert to the old behavior by setting the
> spark.sql.specializeSingleDistinctAggPlanning option to false. See also:
> https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L452-L462
>
> HTH
>
> Kind regards,
>
> Herman van Hövell
>
>
> 2016-02-12 16:23 GMT+01:00 Le Tien Dung :
>
>> Hi folks,
>>
>> I have compared the performance of Spark SQL version 1.6.0 and version
>> 1.5.2. In a simple case, Spark 1.6.0 is quite faster than Spark 1.5.2.
>> However in a more complex query - in our case it is an aggregation query
>> with grouping sets, Spark SQL version 1.6.0 is very much slower than Spark
>> SQL version 1.5. Could any of you kindly let us know a workaround for this
>> performance regression ?
>>
>> Here is our test scenario:
>>
>> case class Toto(
>>  a: String = f"${(math.random*1e6).toLong}%06.0f",
>>  b: String = f"${(math.random*1e6).toLong}%06.0f",
>>  c: String = f"${(math.random*1e6).toLong}%06.0f",
>>  n: Int = (math.random*1e3).toInt,
>>  m: Double = (math.random*1e3))
>>
>> val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
>> val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data
>> )
>>
>> df.registerTempTable( "toto" )
>> val sqlSelect = "SELECT a, b, COUNT(1) AS k1, COUNT(DISTINCT n) AS k2,
>> SUM(m) AS k3"
>> val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
>> val sqlText = s"$sqlSelect $sqlGroupBy"
>>
>> val rs1 = sqlContext.sql( sqlText )
>> rs1.saveAsParquetFile( "rs1" )
>>
>> The query is executed from a spark-shell in local mode with
>> --driver-memory=1G. Screenshots from Spark UI are accessible at
>> http://i.stack.imgur.com/VujQY.png (Spark 1.5.2) and
>> http://i.stack.imgur.com/Hlg95.png (Spark 1.6.0). The DAG on Spark 1.6.0
>> can be viewed at http://i.stack.imgur.com/u3HrG.png.
>>
>> Many thanks and looking forward to hearing from you,
>> Tien-Dung Le
>>
>>
>


Spark SQL performance: version 1.6 vs version 1.5

2016-02-12 Thread Le Tien Dung
Hi folks,

I have compared the performance of Spark SQL version 1.6.0 and version
1.5.2. In a simple case, Spark 1.6.0 is quite faster than Spark 1.5.2.
However in a more complex query - in our case it is an aggregation query
with grouping sets, Spark SQL version 1.6.0 is very much slower than Spark
SQL version 1.5. Could any of you kindly let us know a workaround for this
performance regression ?

Here is our test scenario:

case class Toto(
 a: String = f"${(math.random*1e6).toLong}%06.0f",
 b: String = f"${(math.random*1e6).toLong}%06.0f",
 c: String = f"${(math.random*1e6).toLong}%06.0f",
 n: Int = (math.random*1e3).toInt,
 m: Double = (math.random*1e3))

val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )

df.registerTempTable( "toto" )
val sqlSelect = "SELECT a, b, COUNT(1) AS k1, COUNT(DISTINCT n) AS k2,
SUM(m) AS k3"
val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
val sqlText = s"$sqlSelect $sqlGroupBy"

val rs1 = sqlContext.sql( sqlText )
rs1.saveAsParquetFile( "rs1" )

The query is executed from a spark-shell in local mode with
--driver-memory=1G. Screenshots from Spark UI are accessible at
http://i.stack.imgur.com/VujQY.png (Spark 1.5.2) and
http://i.stack.imgur.com/Hlg95.png (Spark 1.6.0). The DAG on Spark 1.6.0
can be viewed at http://i.stack.imgur.com/u3HrG.png.

Many thanks and looking forward to hearing from you,
Tien-Dung Le