Re: SPARK_WORKER_MEMORY in Spark Standalone - conf.getenv vs System.getenv?

2016-02-12 Thread Sean Owen
Yes you said it is only set in a props file, but why do you say that?
because the resolution of your first question is that this is not
differently handled.

On Fri, Feb 12, 2016 at 11:11 PM, Jacek Laskowski  wrote:
> On Fri, Feb 12, 2016 at 11:08 PM, Sean Owen  wrote:
>> I think that difference in the code is just an oversight. They
>> actually do the same thing.
>
> Correct. Just meant to know the reason if there was any.
>
>> Why do you say this property can only be set in a file?
>
> I said that conf/spark-defaults.conf can *not* be used to set
> spark.worker.ui.port property and wondered why is so? It'd be nice to
> have it for settings (not use workarounds like
> SPARK_WORKER_OPTS=-Dspark.worker.ui.port=21212). Just spot it and
> thought I'd ask if it needs to be cleaned up or improved.
>
> Jacek

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK_WORKER_MEMORY in Spark Standalone - conf.getenv vs System.getenv?

2016-02-12 Thread Jacek Laskowski
On Fri, Feb 12, 2016 at 11:08 PM, Sean Owen  wrote:
> I think that difference in the code is just an oversight. They
> actually do the same thing.

Correct. Just meant to know the reason if there was any.

> Why do you say this property can only be set in a file?

I said that conf/spark-defaults.conf can *not* be used to set
spark.worker.ui.port property and wondered why is so? It'd be nice to
have it for settings (not use workarounds like
SPARK_WORKER_OPTS=-Dspark.worker.ui.port=21212). Just spot it and
thought I'd ask if it needs to be cleaned up or improved.

Jacek

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK_WORKER_MEMORY in Spark Standalone - conf.getenv vs System.getenv?

2016-02-12 Thread Sean Owen
I think that difference in the code is just an oversight. They
actually do the same thing.

Why do you say this property can only be set in a file?

On Fri, Feb 12, 2016 at 9:39 PM, Jacek Laskowski  wrote:
> Hi devs,
>
> Following up on this, it appears that spark.worker.ui.port can only be
> set in --properties-file. I wonder why conf/spark-defaults.conf is
> *not* used for the spark.worker.ui.port property? Any reason for the
> decision?
>
> Pozdrawiam,
> Jacek
>
> Jacek Laskowski | https://medium.com/@jaceklaskowski/
> Mastering Apache Spark
> ==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Feb 11, 2016 at 2:51 PM, Jacek Laskowski  wrote:
>> Hi,
>>
>> Is there a reason to use conf to read SPARK_WORKER_MEMORY not
>> System.getenv as for the other env vars?
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/WorkerArguments.scala#L45
>>
>> Pozdrawiam,
>> Jacek
>>
>> Jacek Laskowski | https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark
>> ==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
>> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK_WORKER_MEMORY in Spark Standalone - conf.getenv vs System.getenv?

2016-02-12 Thread Jacek Laskowski
Hi devs,

Following up on this, it appears that spark.worker.ui.port can only be
set in --properties-file. I wonder why conf/spark-defaults.conf is
*not* used for the spark.worker.ui.port property? Any reason for the
decision?

Pozdrawiam,
Jacek

Jacek Laskowski | https://medium.com/@jaceklaskowski/
Mastering Apache Spark
==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski


On Thu, Feb 11, 2016 at 2:51 PM, Jacek Laskowski  wrote:
> Hi,
>
> Is there a reason to use conf to read SPARK_WORKER_MEMORY not
> System.getenv as for the other env vars?
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/WorkerArguments.scala#L45
>
> Pozdrawiam,
> Jacek
>
> Jacek Laskowski | https://medium.com/@jaceklaskowski/
> Mastering Apache Spark
> ==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Saving a Pipeline with DecisionTreeModel Spark ML

2016-02-12 Thread Rakesh Chalasani
There is already JIRA tracking this
https://issues.apache.org/jira/browse/SPARK-11888

On Fri, Feb 12, 2016 at 2:34 PM gstvolvr  wrote:

> Hi all,
>
> I noticed that I cannot save a Pipeline containing a DecisionTree model
> similar to the way I can save one with a LogisticRegression model.
> It looks like DecisionTreeClassificationModel does not implement
> MLWritable.
>
> I describe a use case in  this post
> <
> http://stackoverflow.com/questions/35368414/saving-a-pipeline-with-decisiontreemodel-spark-ml
> >
> .
>
> Is there another way of doing this or should I open a JIRA?
>
> Thanks,
> Gustavo
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Saving-a-Pipeline-with-DecisionTreeModel-Spark-ML-tp16324.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Saving a Pipeline with DecisionTreeModel Spark ML

2016-02-12 Thread gstvolvr
Hi all,

I noticed that I cannot save a Pipeline containing a DecisionTree model
similar to the way I can save one with a LogisticRegression model. 
It looks like DecisionTreeClassificationModel does not implement MLWritable.

I describe a use case in  this post

 
.

Is there another way of doing this or should I open a JIRA? 

Thanks,
Gustavo





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Saving-a-Pipeline-with-DecisionTreeModel-Spark-ML-tp16324.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark SQL performance: version 1.6 vs version 1.5

2016-02-12 Thread Le Tien Dung
Hi Herman,

We are very happy to receive your mail. Indeed, we can revert to the
old behaviour of Spark SQL (the performance and the DAG are the same in
both version).

Many thanks and have a nice weekend,
Tien-Dung

PS: In order to revert, the setting value should be "true".

On Fri, Feb 12, 2016 at 4:51 PM, Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> Hi Tien-Dung,
>
> 1.6 plans single distinct aggregates like multiple distinct aggregates;
> this inherently causes some overhead but is more stable in case of high
> cardinalities. You can revert to the old behavior by setting the
> spark.sql.specializeSingleDistinctAggPlanning option to false. See also:
> https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L452-L462
>
> HTH
>
> Kind regards,
>
> Herman van Hövell
>
>
> 2016-02-12 16:23 GMT+01:00 Le Tien Dung :
>
>> Hi folks,
>>
>> I have compared the performance of Spark SQL version 1.6.0 and version
>> 1.5.2. In a simple case, Spark 1.6.0 is quite faster than Spark 1.5.2.
>> However in a more complex query - in our case it is an aggregation query
>> with grouping sets, Spark SQL version 1.6.0 is very much slower than Spark
>> SQL version 1.5. Could any of you kindly let us know a workaround for this
>> performance regression ?
>>
>> Here is our test scenario:
>>
>> case class Toto(
>>  a: String = f"${(math.random*1e6).toLong}%06.0f",
>>  b: String = f"${(math.random*1e6).toLong}%06.0f",
>>  c: String = f"${(math.random*1e6).toLong}%06.0f",
>>  n: Int = (math.random*1e3).toInt,
>>  m: Double = (math.random*1e3))
>>
>> val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
>> val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data
>> )
>>
>> df.registerTempTable( "toto" )
>> val sqlSelect = "SELECT a, b, COUNT(1) AS k1, COUNT(DISTINCT n) AS k2,
>> SUM(m) AS k3"
>> val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
>> val sqlText = s"$sqlSelect $sqlGroupBy"
>>
>> val rs1 = sqlContext.sql( sqlText )
>> rs1.saveAsParquetFile( "rs1" )
>>
>> The query is executed from a spark-shell in local mode with
>> --driver-memory=1G. Screenshots from Spark UI are accessible at
>> http://i.stack.imgur.com/VujQY.png (Spark 1.5.2) and
>> http://i.stack.imgur.com/Hlg95.png (Spark 1.6.0). The DAG on Spark 1.6.0
>> can be viewed at http://i.stack.imgur.com/u3HrG.png.
>>
>> Many thanks and looking forward to hearing from you,
>> Tien-Dung Le
>>
>>
>


Re: Spark SQL performance: version 1.6 vs version 1.5

2016-02-12 Thread Herman van Hövell tot Westerflier
Hi Tien-Dung,

1.6 plans single distinct aggregates like multiple distinct aggregates;
this inherently causes some overhead but is more stable in case of high
cardinalities. You can revert to the old behavior by setting the
spark.sql.specializeSingleDistinctAggPlanning option to false. See also:
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L452-L462

HTH

Kind regards,

Herman van Hövell


2016-02-12 16:23 GMT+01:00 Le Tien Dung :

> Hi folks,
>
> I have compared the performance of Spark SQL version 1.6.0 and version
> 1.5.2. In a simple case, Spark 1.6.0 is quite faster than Spark 1.5.2.
> However in a more complex query - in our case it is an aggregation query
> with grouping sets, Spark SQL version 1.6.0 is very much slower than Spark
> SQL version 1.5. Could any of you kindly let us know a workaround for this
> performance regression ?
>
> Here is our test scenario:
>
> case class Toto(
>  a: String = f"${(math.random*1e6).toLong}%06.0f",
>  b: String = f"${(math.random*1e6).toLong}%06.0f",
>  c: String = f"${(math.random*1e6).toLong}%06.0f",
>  n: Int = (math.random*1e3).toInt,
>  m: Double = (math.random*1e3))
>
> val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
> val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )
>
> df.registerTempTable( "toto" )
> val sqlSelect = "SELECT a, b, COUNT(1) AS k1, COUNT(DISTINCT n) AS k2,
> SUM(m) AS k3"
> val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
> val sqlText = s"$sqlSelect $sqlGroupBy"
>
> val rs1 = sqlContext.sql( sqlText )
> rs1.saveAsParquetFile( "rs1" )
>
> The query is executed from a spark-shell in local mode with
> --driver-memory=1G. Screenshots from Spark UI are accessible at
> http://i.stack.imgur.com/VujQY.png (Spark 1.5.2) and
> http://i.stack.imgur.com/Hlg95.png (Spark 1.6.0). The DAG on Spark 1.6.0
> can be viewed at http://i.stack.imgur.com/u3HrG.png.
>
> Many thanks and looking forward to hearing from you,
> Tien-Dung Le
>
>


Spark SQL performance: version 1.6 vs version 1.5

2016-02-12 Thread Le Tien Dung
Hi folks,

I have compared the performance of Spark SQL version 1.6.0 and version
1.5.2. In a simple case, Spark 1.6.0 is quite faster than Spark 1.5.2.
However in a more complex query - in our case it is an aggregation query
with grouping sets, Spark SQL version 1.6.0 is very much slower than Spark
SQL version 1.5. Could any of you kindly let us know a workaround for this
performance regression ?

Here is our test scenario:

case class Toto(
 a: String = f"${(math.random*1e6).toLong}%06.0f",
 b: String = f"${(math.random*1e6).toLong}%06.0f",
 c: String = f"${(math.random*1e6).toLong}%06.0f",
 n: Int = (math.random*1e3).toInt,
 m: Double = (math.random*1e3))

val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )

df.registerTempTable( "toto" )
val sqlSelect = "SELECT a, b, COUNT(1) AS k1, COUNT(DISTINCT n) AS k2,
SUM(m) AS k3"
val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
val sqlText = s"$sqlSelect $sqlGroupBy"

val rs1 = sqlContext.sql( sqlText )
rs1.saveAsParquetFile( "rs1" )

The query is executed from a spark-shell in local mode with
--driver-memory=1G. Screenshots from Spark UI are accessible at
http://i.stack.imgur.com/VujQY.png (Spark 1.5.2) and
http://i.stack.imgur.com/Hlg95.png (Spark 1.6.0). The DAG on Spark 1.6.0
can be viewed at http://i.stack.imgur.com/u3HrG.png.

Many thanks and looking forward to hearing from you,
Tien-Dung Le