[jira] [Commented] (SPARK-16183) Large Spark SQL commands cause StackOverflowError in parser when using sqlContext.sql

2016-06-24 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347874#comment-15347874
 ] 

Dongjoon Hyun commented on SPARK-16183:
---

Hi, [~UZiVcbfPXaNrMtT].
Could you provide some sample query to regenerate your situation?

> Large Spark SQL commands cause StackOverflowError in parser when using 
> sqlContext.sql
> -
>
> Key: SPARK-16183
> URL: https://issues.apache.org/jira/browse/SPARK-16183
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.1
> Environment: Running on AWS EMR
>Reporter: Matthew Porter
>
> Hi,
> I have created a PySpark SQL-based tool which auto-generates a complex SQL 
> command to be run via sqlContext.sql(cmd) based on a large number of 
> parameters. As the number of input files to be filtered and joined in this 
> query grows, so does the length of the SQL query. The tool runs fine up until 
> about 200+ files are included in the join, at which point the SQL command 
> becomes very long (~100K characters). It is only on these longer queries that 
> Spark fails, throwing an exception due to what seems to be too much recursion 
> occurring within the SparkSQL parser:
> {code}
> Traceback (most recent call last):
> ...
> merged_df = sqlsc.sql(cmd)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 
> 580, in sql
>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", 
> line 813, in __call__
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, 
> in deco
>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 
> 308, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o173.sql.
> : java.lang.StackOverflowError
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.sca

[jira] [Created] (SPARK-16185) Unresolved Operator When Creating Table As Select Without Enabling Hive Support

2016-06-24 Thread Xiao Li (JIRA)
Xiao Li created SPARK-16185:
---

 Summary: Unresolved Operator When Creating Table As Select Without 
Enabling Hive Support
 Key: SPARK-16185
 URL: https://issues.apache.org/jira/browse/SPARK-16185
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


When we do not turn on the Hive Support, the following query generates a 
confusing error message:

{noformat}
  sql("CREATE TABLE t2 USING parquet SELECT a, b from t1")
{noformat}

{noformat}
unresolved operator 'CreateHiveTableAsSelectLogicalPlan CatalogTable(
Table: `t`
Created: Fri Jun 24 00:09:15 PDT 2016
Last Access: Wed Dec 31 15:59:59 PST 1969
Type: MANAGED
Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, 
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), 
false;
org.apache.spark.sql.AnalysisException: unresolved operator 
'CreateHiveTableAsSelectLogicalPlan CatalogTable(
Table: `t`
Created: Fri Jun 24 00:09:15 PDT 2016
Last Access: Wed Dec 31 15:59:59 PST 1969
Type: MANAGED
Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, 
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), 
false;
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16185) Unresolved Operator When Creating Table As Select Without Enabling Hive Support

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16185:


Assignee: (was: Apache Spark)

> Unresolved Operator When Creating Table As Select Without Enabling Hive 
> Support
> ---
>
> Key: SPARK-16185
> URL: https://issues.apache.org/jira/browse/SPARK-16185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> When we do not turn on the Hive Support, the following query generates a 
> confusing error message:
> {noformat}
>   sql("CREATE TABLE t2 USING parquet SELECT a, b from t1")
> {noformat}
> {noformat}
> unresolved operator 'CreateHiveTableAsSelectLogicalPlan CatalogTable(
>   Table: `t`
>   Created: Fri Jun 24 00:09:15 PDT 2016
>   Last Access: Wed Dec 31 15:59:59 PST 1969
>   Type: MANAGED
>   Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, 
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), 
> false;
> org.apache.spark.sql.AnalysisException: unresolved operator 
> 'CreateHiveTableAsSelectLogicalPlan CatalogTable(
>   Table: `t`
>   Created: Fri Jun 24 00:09:15 PDT 2016
>   Last Access: Wed Dec 31 15:59:59 PST 1969
>   Type: MANAGED
>   Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, 
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), 
> false;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16185) Unresolved Operator When Creating Table As Select Without Enabling Hive Support

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16185:


Assignee: Apache Spark

> Unresolved Operator When Creating Table As Select Without Enabling Hive 
> Support
> ---
>
> Key: SPARK-16185
> URL: https://issues.apache.org/jira/browse/SPARK-16185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> When we do not turn on the Hive Support, the following query generates a 
> confusing error message:
> {noformat}
>   sql("CREATE TABLE t2 USING parquet SELECT a, b from t1")
> {noformat}
> {noformat}
> unresolved operator 'CreateHiveTableAsSelectLogicalPlan CatalogTable(
>   Table: `t`
>   Created: Fri Jun 24 00:09:15 PDT 2016
>   Last Access: Wed Dec 31 15:59:59 PST 1969
>   Type: MANAGED
>   Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, 
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), 
> false;
> org.apache.spark.sql.AnalysisException: unresolved operator 
> 'CreateHiveTableAsSelectLogicalPlan CatalogTable(
>   Table: `t`
>   Created: Fri Jun 24 00:09:15 PDT 2016
>   Last Access: Wed Dec 31 15:59:59 PST 1969
>   Type: MANAGED
>   Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, 
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), 
> false;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16185) Unresolved Operator When Creating Table As Select Without Enabling Hive Support

2016-06-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347890#comment-15347890
 ] 

Apache Spark commented on SPARK-16185:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/13886

> Unresolved Operator When Creating Table As Select Without Enabling Hive 
> Support
> ---
>
> Key: SPARK-16185
> URL: https://issues.apache.org/jira/browse/SPARK-16185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> When we do not turn on the Hive Support, the following query generates a 
> confusing error message:
> {noformat}
>   sql("CREATE TABLE t2 USING parquet SELECT a, b from t1")
> {noformat}
> {noformat}
> unresolved operator 'CreateHiveTableAsSelectLogicalPlan CatalogTable(
>   Table: `t`
>   Created: Fri Jun 24 00:09:15 PDT 2016
>   Last Access: Wed Dec 31 15:59:59 PST 1969
>   Type: MANAGED
>   Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, 
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), 
> false;
> org.apache.spark.sql.AnalysisException: unresolved operator 
> 'CreateHiveTableAsSelectLogicalPlan CatalogTable(
>   Table: `t`
>   Created: Fri Jun 24 00:09:15 PDT 2016
>   Last Access: Wed Dec 31 15:59:59 PST 1969
>   Type: MANAGED
>   Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, 
> OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), 
> false;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16125) YarnClusterSuite test cluster mode incorrectly

2016-06-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16125.
---
   Resolution: Fixed
Fix Version/s: 2.0.1

Issue resolved by pull request 13836
[https://github.com/apache/spark/pull/13836]

> YarnClusterSuite test cluster mode incorrectly
> --
>
> Key: SPARK-16125
> URL: https://issues.apache.org/jira/browse/SPARK-16125
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Peng Zhang
>Priority: Minor
> Fix For: 2.0.1
>
>
> In YarnClusterSuite, it test cluster mode with:
> {code}
> if (conf.get("spark.master") == "yarn-cluster") {
> {code}
> But since SPARK-13220 change, conf.get("spark.master") will get "yarn". So 
> this *if* condition will always be *false*, and coding in this block will 
> never be executed. I thinks it should change to:
> {code}
> if (conf.get("spark.submit.deployMode") == "cluster") {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16125) YarnClusterSuite test cluster mode incorrectly

2016-06-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16125:
--
Assignee: Peng Zhang

> YarnClusterSuite test cluster mode incorrectly
> --
>
> Key: SPARK-16125
> URL: https://issues.apache.org/jira/browse/SPARK-16125
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Peng Zhang
>Assignee: Peng Zhang
>Priority: Minor
> Fix For: 2.0.1
>
>
> In YarnClusterSuite, it test cluster mode with:
> {code}
> if (conf.get("spark.master") == "yarn-cluster") {
> {code}
> But since SPARK-13220 change, conf.get("spark.master") will get "yarn". So 
> this *if* condition will always be *false*, and coding in this block will 
> never be executed. I thinks it should change to:
> {code}
> if (conf.get("spark.submit.deployMode") == "cluster") {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16186) Support partition batch pruning with `IN` predicate in InMemoryTableScanExec

2016-06-24 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-16186:
-

 Summary: Support partition batch pruning with `IN` predicate in 
InMemoryTableScanExec
 Key: SPARK-16186
 URL: https://issues.apache.org/jira/browse/SPARK-16186
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Dongjoon Hyun


One of the most frequent usage patterns for Spark SQL is using **cached 
tables**.

This issue improves `InMemoryTableScanExec` to handle `IN` predicate 
efficiently by pruning partition batches.

Of course, the performance improvement varies over the queries and the 
datasets. For the following simple query, the query duration in Spark UI goes 
from 9 seconds to 50~90ms. It's about 100 times faster.
{code}
$ bin/spark-shell --driver-memory 6G
scala> val df = spark.range(20)
scala> df.createOrReplaceTempView("t")
scala> spark.catalog.cacheTable("t")
scala> sql("select id from t where id = 1").collect()// About 2 mins
scala> sql("select id from t where id = 1").collect()// less than 90ms
scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds
scala> 
spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
10)  // Enable. (Just to show this examples, currently the default value is 10.)
scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
0)  // Disable
scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
{code}

This issue has impacts over 35 queries of TPC-DS if the tables are cached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16186) Support partition batch pruning with `IN` predicate in InMemoryTableScanExec

2016-06-24 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16186:
--
Description: 
One of the most frequent usage patterns for Spark SQL is using **cached 
tables**.

This issue improves `InMemoryTableScanExec` to handle `IN` predicate 
efficiently by pruning partition batches.

Of course, the performance improvement varies over the queries and the 
datasets. For the following simple query, the query duration in Spark UI goes 
from 9 seconds to 50~90ms. It's about 100 times faster.
{code}
$ bin/spark-shell --driver-memory 6G
scala> val df = spark.range(20)
scala> df.createOrReplaceTempView("t")
scala> spark.catalog.cacheTable("t")
scala> sql("select id from t where id = 1").collect()// About 2 mins
scala> sql("select id from t where id = 1").collect()// less than 90ms
scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds
scala> 
spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
10)  // Enable. (Just to show this examples, currently the default value is 10.)
scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
0)  // Disable
scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
{code}

This issue has impacts over 35 queries of TPC-DS if the tables are cached.

  was:
One of the most frequent usage patterns for Spark SQL is using **cached 
tables**.

This issue improves `InMemoryTableScanExec` to handle `IN` predicate 
efficiently by pruning partition batches.

Of course, the performance improvement varies over the queries and the 
datasets. For the following simple query, the query duration in Spark UI goes 
from 9 seconds to 50~90ms. It's about 100 times faster.
{code}
$ bin/spark-shell --driver-memory 6G
scala> val df = spark.range(20)
scala> df.createOrReplaceTempView("t")
scala> spark.catalog.cacheTable("t")
scala> sql("select id from t where id = 1").collect()// About 2 mins
scala> sql("select id from t where id = 1").collect()// less than 90ms
scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds
scala> 
spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
10)  // Enable. (Just to show this examples, currently the default value is 10.)
scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
0)  // Disable
scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
{code}

This issue has impacts over 35 queries of TPC-DS if the tables are cached.


> Support partition batch pruning with `IN` predicate in InMemoryTableScanExec
> 
>
> Key: SPARK-16186
> URL: https://issues.apache.org/jira/browse/SPARK-16186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> One of the most frequent usage patterns for Spark SQL is using **cached 
> tables**.
> This issue improves `InMemoryTableScanExec` to handle `IN` predicate 
> efficiently by pruning partition batches.
> Of course, the performance improvement varies over the queries and the 
> datasets. For the following simple query, the query duration in Spark UI goes 
> from 9 seconds to 50~90ms. It's about 100 times faster.
> {code}
> $ bin/spark-shell --driver-memory 6G
> scala> val df = spark.range(20)
> scala> df.createOrReplaceTempView("t")
> scala> spark.catalog.cacheTable("t")
> scala> sql("select id from t where id = 1").collect()// About 2 mins
> scala> sql("select id from t where id = 1").collect()// less than 90ms
> scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds
> scala> 
> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
> 10)  // Enable. (Just to show this examples, currently the default value is 
> 10.)
> scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
> 0)  // Disable
> scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
> {code}
> This issue has impacts over 35 queries of TPC-DS if the tables are cached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16186) Support partition batch pruning with `IN` predicate in InMemoryTableScanExec

2016-06-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347992#comment-15347992
 ] 

Apache Spark commented on SPARK-16186:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/13887

> Support partition batch pruning with `IN` predicate in InMemoryTableScanExec
> 
>
> Key: SPARK-16186
> URL: https://issues.apache.org/jira/browse/SPARK-16186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> One of the most frequent usage patterns for Spark SQL is using **cached 
> tables**.
> This issue improves `InMemoryTableScanExec` to handle `IN` predicate 
> efficiently by pruning partition batches.
> Of course, the performance improvement varies over the queries and the 
> datasets. For the following simple query, the query duration in Spark UI goes 
> from 9 seconds to 50~90ms. It's about 100 times faster.
> {code}
> $ bin/spark-shell --driver-memory 6G
> scala> val df = spark.range(20)
> scala> df.createOrReplaceTempView("t")
> scala> spark.catalog.cacheTable("t")
> scala> sql("select id from t where id = 1").collect()// About 2 mins
> scala> sql("select id from t where id = 1").collect()// less than 90ms
> scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds
> scala> 
> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
> 10)  // Enable. (Just to show this examples, currently the default value is 
> 10.)
> scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
> 0)  // Disable
> scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
> {code}
> This issue has impacts over 35 queries of TPC-DS if the tables are cached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16186) Support partition batch pruning with `IN` predicate in InMemoryTableScanExec

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16186:


Assignee: (was: Apache Spark)

> Support partition batch pruning with `IN` predicate in InMemoryTableScanExec
> 
>
> Key: SPARK-16186
> URL: https://issues.apache.org/jira/browse/SPARK-16186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> One of the most frequent usage patterns for Spark SQL is using **cached 
> tables**.
> This issue improves `InMemoryTableScanExec` to handle `IN` predicate 
> efficiently by pruning partition batches.
> Of course, the performance improvement varies over the queries and the 
> datasets. For the following simple query, the query duration in Spark UI goes 
> from 9 seconds to 50~90ms. It's about 100 times faster.
> {code}
> $ bin/spark-shell --driver-memory 6G
> scala> val df = spark.range(20)
> scala> df.createOrReplaceTempView("t")
> scala> spark.catalog.cacheTable("t")
> scala> sql("select id from t where id = 1").collect()// About 2 mins
> scala> sql("select id from t where id = 1").collect()// less than 90ms
> scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds
> scala> 
> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
> 10)  // Enable. (Just to show this examples, currently the default value is 
> 10.)
> scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
> 0)  // Disable
> scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
> {code}
> This issue has impacts over 35 queries of TPC-DS if the tables are cached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16186) Support partition batch pruning with `IN` predicate in InMemoryTableScanExec

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16186:


Assignee: Apache Spark

> Support partition batch pruning with `IN` predicate in InMemoryTableScanExec
> 
>
> Key: SPARK-16186
> URL: https://issues.apache.org/jira/browse/SPARK-16186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> One of the most frequent usage patterns for Spark SQL is using **cached 
> tables**.
> This issue improves `InMemoryTableScanExec` to handle `IN` predicate 
> efficiently by pruning partition batches.
> Of course, the performance improvement varies over the queries and the 
> datasets. For the following simple query, the query duration in Spark UI goes 
> from 9 seconds to 50~90ms. It's about 100 times faster.
> {code}
> $ bin/spark-shell --driver-memory 6G
> scala> val df = spark.range(20)
> scala> df.createOrReplaceTempView("t")
> scala> spark.catalog.cacheTable("t")
> scala> sql("select id from t where id = 1").collect()// About 2 mins
> scala> sql("select id from t where id = 1").collect()// less than 90ms
> scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds
> scala> 
> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
> 10)  // Enable. (Just to show this examples, currently the default value is 
> 10.)
> scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
> 0)  // Disable
> scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
> {code}
> This issue has impacts over 35 queries of TPC-DS if the tables are cached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16187) Implement util method for ML Matrix conversion in scala/java

2016-06-24 Thread yuhao yang (JIRA)
yuhao yang created SPARK-16187:
--

 Summary: Implement util method for ML Matrix conversion in 
scala/java
 Key: SPARK-16187
 URL: https://issues.apache.org/jira/browse/SPARK-16187
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: yuhao yang


This is to provide conversion utils between old/new vector columns in a 
DataFrame. So users can use it to migrate their datasets and pipelines manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16187) Implement util method for ML Matrix conversion in scala/java

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16187:


Assignee: Apache Spark

> Implement util method for ML Matrix conversion in scala/java
> 
>
> Key: SPARK-16187
> URL: https://issues.apache.org/jira/browse/SPARK-16187
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: yuhao yang
>Assignee: Apache Spark
>
> This is to provide conversion utils between old/new vector columns in a 
> DataFrame. So users can use it to migrate their datasets and pipelines 
> manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16187) Implement util method for ML Matrix conversion in scala/java

2016-06-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347999#comment-15347999
 ] 

Apache Spark commented on SPARK-16187:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/13888

> Implement util method for ML Matrix conversion in scala/java
> 
>
> Key: SPARK-16187
> URL: https://issues.apache.org/jira/browse/SPARK-16187
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: yuhao yang
>
> This is to provide conversion utils between old/new vector columns in a 
> DataFrame. So users can use it to migrate their datasets and pipelines 
> manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16187) Implement util method for ML Matrix conversion in scala/java

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16187:


Assignee: (was: Apache Spark)

> Implement util method for ML Matrix conversion in scala/java
> 
>
> Key: SPARK-16187
> URL: https://issues.apache.org/jira/browse/SPARK-16187
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: yuhao yang
>
> This is to provide conversion utils between old/new vector columns in a 
> DataFrame. So users can use it to migrate their datasets and pipelines 
> manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16188) Spark sql will create a lot of small files

2016-06-24 Thread cen yuhai (JIRA)
cen yuhai created SPARK-16188:
-

 Summary: Spark sql will create a lot of small files
 Key: SPARK-16188
 URL: https://issues.apache.org/jira/browse/SPARK-16188
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0, 2.0.0
 Environment: spark 1.6.1
Reporter: cen yuhai


I find that spark sql will create files as many as partition size. When the 
results are small, there will be too many small files and most of them are 
empty. 

Hive have a function to detect the avg of file size. If  avg file size is 
smaller than "hive.merge.smallfiles.avgsize", hive will add a job to merge 
files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16188) Spark sql create a lot of small files

2016-06-24 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-16188:
--
Summary: Spark sql create a lot of small files  (was: Spark sql will create 
a lot of small files)

> Spark sql create a lot of small files
> -
>
> Key: SPARK-16188
> URL: https://issues.apache.org/jira/browse/SPARK-16188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
> Environment: spark 1.6.1
>Reporter: cen yuhai
>
> I find that spark sql will create files as many as partition size. When the 
> results are small, there will be too many small files and most of them are 
> empty. 
> Hive have a function to detect the avg of file size. If  avg file size is 
> smaller than "hive.merge.smallfiles.avgsize", hive will add a job to merge 
> files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16186) Support partition batch pruning with `IN` predicate in InMemoryTableScanExec

2016-06-24 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16186:
--
Description: 
One of the most frequent usage patterns for Spark SQL is using **cached 
tables**.

This issue improves `InMemoryTableScanExec` to handle `IN` predicate 
efficiently by pruning partition batches.

Of course, the performance improvement varies over the queries and the 
datasets. For the following simple query, the query duration in Spark UI goes 
from 9 seconds to 50~90ms. It's about over 100 times faster.
{code}
$ bin/spark-shell --driver-memory 6G
scala> val df = spark.range(20)
scala> df.createOrReplaceTempView("t")
scala> spark.catalog.cacheTable("t")
scala> sql("select id from t where id = 1").collect()// About 2 mins
scala> sql("select id from t where id = 1").collect()// less than 90ms
scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds
scala> 
spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
10)  // Enable. (Just to show this examples, currently the default value is 10.)
scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
scala> 
spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
0)  // Disable
scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
{code}

This issue has impacts over 35 queries of TPC-DS if the tables are cached.

  was:
One of the most frequent usage patterns for Spark SQL is using **cached 
tables**.

This issue improves `InMemoryTableScanExec` to handle `IN` predicate 
efficiently by pruning partition batches.

Of course, the performance improvement varies over the queries and the 
datasets. For the following simple query, the query duration in Spark UI goes 
from 9 seconds to 50~90ms. It's about 100 times faster.
{code}
$ bin/spark-shell --driver-memory 6G
scala> val df = spark.range(20)
scala> df.createOrReplaceTempView("t")
scala> spark.catalog.cacheTable("t")
scala> sql("select id from t where id = 1").collect()// About 2 mins
scala> sql("select id from t where id = 1").collect()// less than 90ms
scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds
scala> 
spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
10)  // Enable. (Just to show this examples, currently the default value is 10.)
scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
0)  // Disable
scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
{code}

This issue has impacts over 35 queries of TPC-DS if the tables are cached.


> Support partition batch pruning with `IN` predicate in InMemoryTableScanExec
> 
>
> Key: SPARK-16186
> URL: https://issues.apache.org/jira/browse/SPARK-16186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> One of the most frequent usage patterns for Spark SQL is using **cached 
> tables**.
> This issue improves `InMemoryTableScanExec` to handle `IN` predicate 
> efficiently by pruning partition batches.
> Of course, the performance improvement varies over the queries and the 
> datasets. For the following simple query, the query duration in Spark UI goes 
> from 9 seconds to 50~90ms. It's about over 100 times faster.
> {code}
> $ bin/spark-shell --driver-memory 6G
> scala> val df = spark.range(20)
> scala> df.createOrReplaceTempView("t")
> scala> spark.catalog.cacheTable("t")
> scala> sql("select id from t where id = 1").collect()// About 2 mins
> scala> sql("select id from t where id = 1").collect()// less than 90ms
> scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds
> scala> 
> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
> 10)  // Enable. (Just to show this examples, currently the default value is 
> 10.)
> scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
> scala> 
> spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
> 0)  // Disable
> scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
> {code}
> This issue has impacts over 35 queries of TPC-DS if the tables are cached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16188) Spark sql create a lot of small files

2016-06-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16188:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Spark sql create a lot of small files
> -
>
> Key: SPARK-16188
> URL: https://issues.apache.org/jira/browse/SPARK-16188
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
> Environment: spark 1.6.1
>Reporter: cen yuhai
>Priority: Minor
>
> I find that spark sql will create files as many as partition size. When the 
> results are small, there will be too many small files and most of them are 
> empty. 
> Hive have a function to detect the avg of file size. If  avg file size is 
> smaller than "hive.merge.smallfiles.avgsize", hive will add a job to merge 
> files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16176) model loading backward compatibility for ml.recommendation

2016-06-24 Thread li taoran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348015#comment-15348015
 ] 

li taoran commented on SPARK-16176:
---

 I have checked that current ALS can load the ALS model saved by Apacke Spark 
1.6 . 

> model loading backward compatibility for ml.recommendation
> --
>
> Key: SPARK-16176
> URL: https://issues.apache.org/jira/browse/SPARK-16176
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Check if current ALS can load the models saved by Apache Spark 1.6. If not, 
> we need a fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16169) Saving Intermediate dataframe increasing processing time upto 5 times.

2016-06-24 Thread Manish Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348043#comment-15348043
 ] 

Manish Kumar commented on SPARK-16169:
--

Even if our code is asking to do more work then some task should be in running 
status but  all the tasks and job gets completed within first 10 minutes and 
the application keeps running for around 50 minutes. And that's what shown in 
the attached screenshot of spark UI. 

In the attached SPARK UI you can see that all the jobs are in completed status 
which happens in the first 10 minutes of execution itself. But the total 
execution time of spark application is 50 minutes.

> Saving Intermediate dataframe increasing processing time upto 5 times.
> --
>
> Key: SPARK-16169
> URL: https://issues.apache.org/jira/browse/SPARK-16169
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit, Web UI
>Affects Versions: 1.6.1
> Environment: Amazon EMR
>Reporter: Manish Kumar
>  Labels: performance
> Attachments: Spark-UI.png
>
>
> When a spark application is (written in scala) trying to save intermediate 
> dataframe, the application is taking processing time almost 5 times. 
> Although the spark-UI clearly shows that all jobs are completed but the spark 
> application remains in running status.
> Below is the command for saving the intermediate output and then using the 
> dataframe.
> {noformat}
> saveDataFrame(flushPath, flushFormat, isCoalesce, flushMode, 
> previousDataFrame, sqlContext)
> previousDataFrame.count
> {noformat}
> Here, previousDataFrame is the result of the last step and saveDataFrame is 
> just saving the DataFrame as given location, then the previousDataFrame will 
> be used by next steps/transformation. 
> Below is the spark UI screenshot which shows jobs completed although some 
> task inside it are neither completed nor skipped.
> !Spark-UI.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16169) Saving Intermediate dataframe increasing processing time upto 5 times.

2016-06-24 Thread Manish Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348043#comment-15348043
 ] 

Manish Kumar edited comment on SPARK-16169 at 6/24/16 9:14 AM:
---

Even if our code is asking to do more work then some task should be in running 
status but  all the tasks and job gets completed within first 10 minutes and 
the application keeps running for around 50 minutes. 

In the attached SPARK UI you can see that all the jobs are in completed status 
which happens in the first 10 minutes of execution itself. But the total 
execution time of spark application is 50 minutes.


was (Author: mkbond777):
Even if our code is asking to do more work then some task should be in running 
status but  all the tasks and job gets completed within first 10 minutes and 
the application keeps running for around 50 minutes. And that's what shown in 
the attached screenshot of spark UI. 

In the attached SPARK UI you can see that all the jobs are in completed status 
which happens in the first 10 minutes of execution itself. But the total 
execution time of spark application is 50 minutes.

> Saving Intermediate dataframe increasing processing time upto 5 times.
> --
>
> Key: SPARK-16169
> URL: https://issues.apache.org/jira/browse/SPARK-16169
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit, Web UI
>Affects Versions: 1.6.1
> Environment: Amazon EMR
>Reporter: Manish Kumar
>  Labels: performance
> Attachments: Spark-UI.png
>
>
> When a spark application is (written in scala) trying to save intermediate 
> dataframe, the application is taking processing time almost 5 times. 
> Although the spark-UI clearly shows that all jobs are completed but the spark 
> application remains in running status.
> Below is the command for saving the intermediate output and then using the 
> dataframe.
> {noformat}
> saveDataFrame(flushPath, flushFormat, isCoalesce, flushMode, 
> previousDataFrame, sqlContext)
> previousDataFrame.count
> {noformat}
> Here, previousDataFrame is the result of the last step and saveDataFrame is 
> just saving the DataFrame as given location, then the previousDataFrame will 
> be used by next steps/transformation. 
> Below is the spark UI screenshot which shows jobs completed although some 
> task inside it are neither completed nor skipped.
> !Spark-UI.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16176) model loading backward compatibility for ml.recommendation

2016-06-24 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348047#comment-15348047
 ] 

yuhao yang commented on SPARK-16176:


Thanks. I've verified that too. Close the issue.

> model loading backward compatibility for ml.recommendation
> --
>
> Key: SPARK-16176
> URL: https://issues.apache.org/jira/browse/SPARK-16176
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Check if current ALS can load the models saved by Apache Spark 1.6. If not, 
> we need a fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16176) model loading backward compatibility for ml.recommendation

2016-06-24 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang closed SPARK-16176.
--
Resolution: Not A Problem

> model loading backward compatibility for ml.recommendation
> --
>
> Key: SPARK-16176
> URL: https://issues.apache.org/jira/browse/SPARK-16176
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Check if current ALS can load the models saved by Apache Spark 1.6. If not, 
> we need a fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15980) Add PushPredicateThroughObjectConsumer rule to Optimizer.

2016-06-24 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin closed SPARK-15980.
-
Resolution: Duplicate

> Add PushPredicateThroughObjectConsumer rule to Optimizer.
> -
>
> Key: SPARK-15980
> URL: https://issues.apache.org/jira/browse/SPARK-15980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>
> I added {{PushPredicateThroughObjectConsumer}} rule to push-down predicates 
> through {{ObjectConsumer}}.
> And as an example, I implemented push-down typed filter through 
> {{SerializeFromObject}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16169) Saving Intermediate dataframe increasing processing time upto 5 times.

2016-06-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348063#comment-15348063
 ] 

Sean Owen commented on SPARK-16169:
---

You may have, for example, significantly skewed data causing one task to take 
much longer than the others. There are simply a lot of reasons for this 
behavior, and without more specific info, I don't know that this is actionable.

> Saving Intermediate dataframe increasing processing time upto 5 times.
> --
>
> Key: SPARK-16169
> URL: https://issues.apache.org/jira/browse/SPARK-16169
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit, Web UI
>Affects Versions: 1.6.1
> Environment: Amazon EMR
>Reporter: Manish Kumar
>  Labels: performance
> Attachments: Spark-UI.png
>
>
> When a spark application is (written in scala) trying to save intermediate 
> dataframe, the application is taking processing time almost 5 times. 
> Although the spark-UI clearly shows that all jobs are completed but the spark 
> application remains in running status.
> Below is the command for saving the intermediate output and then using the 
> dataframe.
> {noformat}
> saveDataFrame(flushPath, flushFormat, isCoalesce, flushMode, 
> previousDataFrame, sqlContext)
> previousDataFrame.count
> {noformat}
> Here, previousDataFrame is the result of the last step and saveDataFrame is 
> just saving the DataFrame as given location, then the previousDataFrame will 
> be used by next steps/transformation. 
> Below is the spark UI screenshot which shows jobs completed although some 
> task inside it are neither completed nor skipped.
> !Spark-UI.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16189) Add ExistingRDD logical plan for input with RDD to have a chance to eliminate serialize/deserialize.

2016-06-24 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-16189:
-

 Summary: Add ExistingRDD logical plan for input with RDD to have a 
chance to eliminate serialize/deserialize.
 Key: SPARK-16189
 URL: https://issues.apache.org/jira/browse/SPARK-16189
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin


Currently the input {{RDD}} of {{Dataset}} is always serialized to 
{{RDD\[InternalRow\]}} prior to being as {{Dataset}}, but there is a case that 
we use {{map}} or {{mapPartitions}} just after converted to {{Dataset}}.
In this case, serialize and then deserialize happens but it would not be needed.

This patch adds {{ExistingRDD}} logical plan for input with {{RDD}} to have a 
chance to eliminate serialize/deserialize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16129) Eliminate direct use of commons-lang classes in favor of commons-lang3

2016-06-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16129.
---
   Resolution: Fixed
Fix Version/s: 2.0.1

Issue resolved by pull request 13843
[https://github.com/apache/spark/pull/13843]

> Eliminate direct use of commons-lang classes in favor of commons-lang3
> --
>
> Key: SPARK-16129
> URL: https://issues.apache.org/jira/browse/SPARK-16129
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.0.1
>
>
> There are several instances where we are still using org.apache.commons.lang 
> classes instead of org.apache.commons.lang3. Only the latter is a direct 
> dependency.
> That's easy enough to fix. I think it might be important for 2.0.0 even at 
> this late hour, because in addition, Commons Lang's NotImplementedException 
> is being used where I believe the JDK's standard 
> UnsupportedOperationException is appropriate. Since these exceptions may 
> conceivably propagate and be handled by user code (?) they sorta become part 
> of an API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16189) Add ExistingRDD logical plan for input with RDD to have a chance to eliminate serialize/deserialize.

2016-06-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348071#comment-15348071
 ] 

Apache Spark commented on SPARK-16189:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13890

> Add ExistingRDD logical plan for input with RDD to have a chance to eliminate 
> serialize/deserialize.
> 
>
> Key: SPARK-16189
> URL: https://issues.apache.org/jira/browse/SPARK-16189
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>
> Currently the input {{RDD}} of {{Dataset}} is always serialized to 
> {{RDD\[InternalRow\]}} prior to being as {{Dataset}}, but there is a case 
> that we use {{map}} or {{mapPartitions}} just after converted to {{Dataset}}.
> In this case, serialize and then deserialize happens but it would not be 
> needed.
> This patch adds {{ExistingRDD}} logical plan for input with {{RDD}} to have a 
> chance to eliminate serialize/deserialize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16189) Add ExistingRDD logical plan for input with RDD to have a chance to eliminate serialize/deserialize.

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16189:


Assignee: (was: Apache Spark)

> Add ExistingRDD logical plan for input with RDD to have a chance to eliminate 
> serialize/deserialize.
> 
>
> Key: SPARK-16189
> URL: https://issues.apache.org/jira/browse/SPARK-16189
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>
> Currently the input {{RDD}} of {{Dataset}} is always serialized to 
> {{RDD\[InternalRow\]}} prior to being as {{Dataset}}, but there is a case 
> that we use {{map}} or {{mapPartitions}} just after converted to {{Dataset}}.
> In this case, serialize and then deserialize happens but it would not be 
> needed.
> This patch adds {{ExistingRDD}} logical plan for input with {{RDD}} to have a 
> chance to eliminate serialize/deserialize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16189) Add ExistingRDD logical plan for input with RDD to have a chance to eliminate serialize/deserialize.

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16189:


Assignee: Apache Spark

> Add ExistingRDD logical plan for input with RDD to have a chance to eliminate 
> serialize/deserialize.
> 
>
> Key: SPARK-16189
> URL: https://issues.apache.org/jira/browse/SPARK-16189
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>
> Currently the input {{RDD}} of {{Dataset}} is always serialized to 
> {{RDD\[InternalRow\]}} prior to being as {{Dataset}}, but there is a case 
> that we use {{map}} or {{mapPartitions}} just after converted to {{Dataset}}.
> In this case, serialize and then deserialize happens but it would not be 
> needed.
> This patch adds {{ExistingRDD}} logical plan for input with {{RDD}} to have a 
> chance to eliminate serialize/deserialize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16188) Spark sql create a lot of small files

2016-06-24 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-16188:
--
Affects Version/s: (was: 2.0.0)

> Spark sql create a lot of small files
> -
>
> Key: SPARK-16188
> URL: https://issues.apache.org/jira/browse/SPARK-16188
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: spark 1.6.1
>Reporter: cen yuhai
>Priority: Minor
>
> I find that spark sql will create files as many as partition size. When the 
> results are small, there will be too many small files and most of them are 
> empty. 
> Hive have a function to detect the avg of file size. If  avg file size is 
> smaller than "hive.merge.smallfiles.avgsize", hive will add a job to merge 
> files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16190) Worker registration failed: Duplicate worker ID

2016-06-24 Thread Thomas Huang (JIRA)
Thomas Huang created SPARK-16190:


 Summary: Worker registration failed: Duplicate worker ID
 Key: SPARK-16190
 URL: https://issues.apache.org/jira/browse/SPARK-16190
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.6.1
Reporter: Thomas Huang
Priority: Critical


Several worker crashed simultaneously due to this error: 
Worker registration failed: Duplicate worker ID

This is the worker log on one of those crashed workers:
16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
16/06/24 16:28:53 INFO ExecutorRunner: Runner thread for executor 
app-20160624003013-0442/26 interrupted
16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: 
java.lang.UNIXProcess@31340137. This process will likely be orphaned.
16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: 
java.lang.UNIXProcess@4d3bdb1d. This process will likely be orphaned.
16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/8 finished with 
state KILLED
16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/26 finished 
with state KILLED
16/06/24 16:29:03 INFO Worker: Cleaning up local directories for application 
app-20160624003013-0442
16/06/24 16:31:18 INFO ExternalShuffleBlockResolver: Application 
app-20160624003013-0442 removed, cleanupLocalDirs = true
16/06/24 16:31:18 INFO Worker: Asked to launch executor 
app-20160624162905-0469/14 for SparkStreamingLRScala
16/06/24 16:31:18 INFO SecurityManager: Changing view acls to: mqq
16/06/24 16:31:18 INFO SecurityManager: Changing modify acls to: mqq
16/06/24 16:31:18 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mqq); users with 
modify permissions: Set(mqq)
16/06/24 16:31:18 INFO ExecutorRunner: Launch command: 
"/data/jdk1.7.0_60/bin/java" "-cp" 
"/data/spark-1.6.1-bin-cdh4/conf/:/data/spark-1.6.1-bin-cdh4/lib/spark-assembly-1.6.1-hadoop2.3.0.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-core-3.2.10.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-api-jdo-3.2.6.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-rdbms-3.2.9.jar"
 "-Xms10240M" "-Xmx10240M" "-Dspark.driver.port=34792" "-XX:MaxPermSize=256m" 
"org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
"spark://CoarseGrainedScheduler@100.65.21.199:34792" "--executor-id" "14" 
"--hostname" "100.65.21.223" "--cores" "5" "--app-id" "app-20160624162905-0469" 
"--worker-url" "spark://Worker@100.65.21.223:46581"
16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
requested this worker to reconnect.
16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
requested this worker to reconnect.
16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
16/06/24 16:31:18 INFO Worker: Successfully registered with master 
spark://100.65.21.199:7077
16/06/24 16:31:18 INFO Worker: Worker cleanup enabled; old application 
directories will be deleted in: /data/spark-1.6.1-bin-cdh4/work
16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
requested this worker to reconnect.
16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
requested this worker to reconnect.
16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
requested this worker to reconnect.
16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
requested this worker to reconnect.
16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
requested this worker to reconnect.
16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
requested this worker to reconnect.
16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
requested this worker to reconnect.
16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/06/24 16:

[jira] [Updated] (SPARK-16190) Worker registration failed: Duplicate worker ID

2016-06-24 Thread Thomas Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Huang updated SPARK-16190:
-
Attachment: spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave7.out

worker log of slave7

> Worker registration failed: Duplicate worker ID
> ---
>
> Key: SPARK-16190
> URL: https://issues.apache.org/jira/browse/SPARK-16190
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1
>Reporter: Thomas Huang
>Priority: Critical
> Attachments: 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave7.out
>
>
> Several worker crashed simultaneously due to this error: 
> Worker registration failed: Duplicate worker ID
> This is the worker log on one of those crashed workers:
> 16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
> 16/06/24 16:28:53 INFO ExecutorRunner: Runner thread for executor 
> app-20160624003013-0442/26 interrupted
> 16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
> 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: 
> java.lang.UNIXProcess@31340137. This process will likely be orphaned.
> 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: 
> java.lang.UNIXProcess@4d3bdb1d. This process will likely be orphaned.
> 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/8 finished 
> with state KILLED
> 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/26 finished 
> with state KILLED
> 16/06/24 16:29:03 INFO Worker: Cleaning up local directories for application 
> app-20160624003013-0442
> 16/06/24 16:31:18 INFO ExternalShuffleBlockResolver: Application 
> app-20160624003013-0442 removed, cleanupLocalDirs = true
> 16/06/24 16:31:18 INFO Worker: Asked to launch executor 
> app-20160624162905-0469/14 for SparkStreamingLRScala
> 16/06/24 16:31:18 INFO SecurityManager: Changing view acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: Changing modify acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(mqq); users with 
> modify permissions: Set(mqq)
> 16/06/24 16:31:18 INFO ExecutorRunner: Launch command: 
> "/data/jdk1.7.0_60/bin/java" "-cp" 
> "/data/spark-1.6.1-bin-cdh4/conf/:/data/spark-1.6.1-bin-cdh4/lib/spark-assembly-1.6.1-hadoop2.3.0.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-core-3.2.10.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-api-jdo-3.2.6.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-rdbms-3.2.9.jar"
>  "-Xms10240M" "-Xmx10240M" "-Dspark.driver.port=34792" "-XX:MaxPermSize=256m" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@100.65.21.199:34792" "--executor-id" "14" 
> "--hostname" "100.65.21.223" "--cores" "5" "--app-id" 
> "app-20160624162905-0469" "--worker-url" "spark://Worker@100.65.21.223:46581"
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
> 16/06/24 16:31:18 INFO Worker: Successfully registered with master 
> spark://100.65.21.199:7077
> 16/06/24 16:31:18 INFO Worker: Worker cleanup enabled; old application 
> directories will be deleted in: /data/spark-1.6.1-bin-cdh4/work
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INF

[jira] [Updated] (SPARK-16190) Worker registration failed: Duplicate worker ID

2016-06-24 Thread Thomas Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Huang updated SPARK-16190:
-
Attachment: spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave8.out

worker log of slave 8

> Worker registration failed: Duplicate worker ID
> ---
>
> Key: SPARK-16190
> URL: https://issues.apache.org/jira/browse/SPARK-16190
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1
>Reporter: Thomas Huang
>Priority: Critical
> Attachments: 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave7.out, 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave8.out
>
>
> Several worker crashed simultaneously due to this error: 
> Worker registration failed: Duplicate worker ID
> This is the worker log on one of those crashed workers:
> 16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
> 16/06/24 16:28:53 INFO ExecutorRunner: Runner thread for executor 
> app-20160624003013-0442/26 interrupted
> 16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
> 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: 
> java.lang.UNIXProcess@31340137. This process will likely be orphaned.
> 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: 
> java.lang.UNIXProcess@4d3bdb1d. This process will likely be orphaned.
> 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/8 finished 
> with state KILLED
> 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/26 finished 
> with state KILLED
> 16/06/24 16:29:03 INFO Worker: Cleaning up local directories for application 
> app-20160624003013-0442
> 16/06/24 16:31:18 INFO ExternalShuffleBlockResolver: Application 
> app-20160624003013-0442 removed, cleanupLocalDirs = true
> 16/06/24 16:31:18 INFO Worker: Asked to launch executor 
> app-20160624162905-0469/14 for SparkStreamingLRScala
> 16/06/24 16:31:18 INFO SecurityManager: Changing view acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: Changing modify acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(mqq); users with 
> modify permissions: Set(mqq)
> 16/06/24 16:31:18 INFO ExecutorRunner: Launch command: 
> "/data/jdk1.7.0_60/bin/java" "-cp" 
> "/data/spark-1.6.1-bin-cdh4/conf/:/data/spark-1.6.1-bin-cdh4/lib/spark-assembly-1.6.1-hadoop2.3.0.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-core-3.2.10.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-api-jdo-3.2.6.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-rdbms-3.2.9.jar"
>  "-Xms10240M" "-Xmx10240M" "-Dspark.driver.port=34792" "-XX:MaxPermSize=256m" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@100.65.21.199:34792" "--executor-id" "14" 
> "--hostname" "100.65.21.223" "--cores" "5" "--app-id" 
> "app-20160624162905-0469" "--worker-url" "spark://Worker@100.65.21.223:46581"
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
> 16/06/24 16:31:18 INFO Worker: Successfully registered with master 
> spark://100.65.21.199:7077
> 16/06/24 16:31:18 INFO Worker: Worker cleanup enabled; old application 
> directories will be deleted in: /data/spark-1.6.1-bin-cdh4/work
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, si

[jira] [Updated] (SPARK-16190) Worker registration failed: Duplicate worker ID

2016-06-24 Thread Thomas Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Huang updated SPARK-16190:
-
Attachment: spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave2.out

worker log of slave 2

> Worker registration failed: Duplicate worker ID
> ---
>
> Key: SPARK-16190
> URL: https://issues.apache.org/jira/browse/SPARK-16190
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1
>Reporter: Thomas Huang
>Priority: Critical
> Attachments: 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave19.out, 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave2.out, 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave7.out, 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave8.out
>
>
> Several worker crashed simultaneously due to this error: 
> Worker registration failed: Duplicate worker ID
> This is the worker log on one of those crashed workers:
> 16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
> 16/06/24 16:28:53 INFO ExecutorRunner: Runner thread for executor 
> app-20160624003013-0442/26 interrupted
> 16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
> 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: 
> java.lang.UNIXProcess@31340137. This process will likely be orphaned.
> 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: 
> java.lang.UNIXProcess@4d3bdb1d. This process will likely be orphaned.
> 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/8 finished 
> with state KILLED
> 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/26 finished 
> with state KILLED
> 16/06/24 16:29:03 INFO Worker: Cleaning up local directories for application 
> app-20160624003013-0442
> 16/06/24 16:31:18 INFO ExternalShuffleBlockResolver: Application 
> app-20160624003013-0442 removed, cleanupLocalDirs = true
> 16/06/24 16:31:18 INFO Worker: Asked to launch executor 
> app-20160624162905-0469/14 for SparkStreamingLRScala
> 16/06/24 16:31:18 INFO SecurityManager: Changing view acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: Changing modify acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(mqq); users with 
> modify permissions: Set(mqq)
> 16/06/24 16:31:18 INFO ExecutorRunner: Launch command: 
> "/data/jdk1.7.0_60/bin/java" "-cp" 
> "/data/spark-1.6.1-bin-cdh4/conf/:/data/spark-1.6.1-bin-cdh4/lib/spark-assembly-1.6.1-hadoop2.3.0.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-core-3.2.10.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-api-jdo-3.2.6.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-rdbms-3.2.9.jar"
>  "-Xms10240M" "-Xmx10240M" "-Dspark.driver.port=34792" "-XX:MaxPermSize=256m" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@100.65.21.199:34792" "--executor-id" "14" 
> "--hostname" "100.65.21.223" "--cores" "5" "--app-id" 
> "app-20160624162905-0469" "--worker-url" "spark://Worker@100.65.21.223:46581"
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
> 16/06/24 16:31:18 INFO Worker: Successfully registered with master 
> spark://100.65.21.199:7077
> 16/06/24 16:31:18 INFO Worker: Worker cleanup enabled; old application 
> directories will be deleted in: /data/spark-1.6.1-bin-cdh4/work
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> r

[jira] [Updated] (SPARK-16190) Worker registration failed: Duplicate worker ID

2016-06-24 Thread Thomas Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Huang updated SPARK-16190:
-
Attachment: spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave19.out

worker log of slave 19

> Worker registration failed: Duplicate worker ID
> ---
>
> Key: SPARK-16190
> URL: https://issues.apache.org/jira/browse/SPARK-16190
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1
>Reporter: Thomas Huang
>Priority: Minor
> Attachments: 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave19.out, 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave2.out, 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave7.out, 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave8.out
>
>
> Several worker crashed simultaneously due to this error: 
> Worker registration failed: Duplicate worker ID
> This is the worker log on one of those crashed workers:
> 16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
> 16/06/24 16:28:53 INFO ExecutorRunner: Runner thread for executor 
> app-20160624003013-0442/26 interrupted
> 16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
> 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: 
> java.lang.UNIXProcess@31340137. This process will likely be orphaned.
> 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: 
> java.lang.UNIXProcess@4d3bdb1d. This process will likely be orphaned.
> 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/8 finished 
> with state KILLED
> 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/26 finished 
> with state KILLED
> 16/06/24 16:29:03 INFO Worker: Cleaning up local directories for application 
> app-20160624003013-0442
> 16/06/24 16:31:18 INFO ExternalShuffleBlockResolver: Application 
> app-20160624003013-0442 removed, cleanupLocalDirs = true
> 16/06/24 16:31:18 INFO Worker: Asked to launch executor 
> app-20160624162905-0469/14 for SparkStreamingLRScala
> 16/06/24 16:31:18 INFO SecurityManager: Changing view acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: Changing modify acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(mqq); users with 
> modify permissions: Set(mqq)
> 16/06/24 16:31:18 INFO ExecutorRunner: Launch command: 
> "/data/jdk1.7.0_60/bin/java" "-cp" 
> "/data/spark-1.6.1-bin-cdh4/conf/:/data/spark-1.6.1-bin-cdh4/lib/spark-assembly-1.6.1-hadoop2.3.0.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-core-3.2.10.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-api-jdo-3.2.6.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-rdbms-3.2.9.jar"
>  "-Xms10240M" "-Xmx10240M" "-Dspark.driver.port=34792" "-XX:MaxPermSize=256m" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@100.65.21.199:34792" "--executor-id" "14" 
> "--hostname" "100.65.21.223" "--cores" "5" "--app-id" 
> "app-20160624162905-0469" "--worker-url" "spark://Worker@100.65.21.223:46581"
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
> 16/06/24 16:31:18 INFO Worker: Successfully registered with master 
> spark://100.65.21.199:7077
> 16/06/24 16:31:18 INFO Worker: Worker cleanup enabled; old application 
> directories will be deleted in: /data/spark-1.6.1-bin-cdh4/work
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> re

[jira] [Updated] (SPARK-16190) Worker registration failed: Duplicate worker ID

2016-06-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16190:
--
Priority: Minor  (was: Critical)

How did they stop and how were they restarted?

> Worker registration failed: Duplicate worker ID
> ---
>
> Key: SPARK-16190
> URL: https://issues.apache.org/jira/browse/SPARK-16190
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1
>Reporter: Thomas Huang
>Priority: Minor
> Attachments: 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave19.out, 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave2.out, 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave7.out, 
> spark-mqq-org.apache.spark.deploy.worker.Worker-1-slave8.out
>
>
> Several worker crashed simultaneously due to this error: 
> Worker registration failed: Duplicate worker ID
> This is the worker log on one of those crashed workers:
> 16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
> 16/06/24 16:28:53 INFO ExecutorRunner: Runner thread for executor 
> app-20160624003013-0442/26 interrupted
> 16/06/24 16:28:53 INFO ExecutorRunner: Killing process!
> 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: 
> java.lang.UNIXProcess@31340137. This process will likely be orphaned.
> 16/06/24 16:29:03 WARN ExecutorRunner: Failed to terminate process: 
> java.lang.UNIXProcess@4d3bdb1d. This process will likely be orphaned.
> 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/8 finished 
> with state KILLED
> 16/06/24 16:29:03 INFO Worker: Executor app-20160624003013-0442/26 finished 
> with state KILLED
> 16/06/24 16:29:03 INFO Worker: Cleaning up local directories for application 
> app-20160624003013-0442
> 16/06/24 16:31:18 INFO ExternalShuffleBlockResolver: Application 
> app-20160624003013-0442 removed, cleanupLocalDirs = true
> 16/06/24 16:31:18 INFO Worker: Asked to launch executor 
> app-20160624162905-0469/14 for SparkStreamingLRScala
> 16/06/24 16:31:18 INFO SecurityManager: Changing view acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: Changing modify acls to: mqq
> 16/06/24 16:31:18 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(mqq); users with 
> modify permissions: Set(mqq)
> 16/06/24 16:31:18 INFO ExecutorRunner: Launch command: 
> "/data/jdk1.7.0_60/bin/java" "-cp" 
> "/data/spark-1.6.1-bin-cdh4/conf/:/data/spark-1.6.1-bin-cdh4/lib/spark-assembly-1.6.1-hadoop2.3.0.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-core-3.2.10.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-api-jdo-3.2.6.jar:/data/spark-1.6.1-bin-cdh4/lib/datanucleus-rdbms-3.2.9.jar"
>  "-Xms10240M" "-Xmx10240M" "-Dspark.driver.port=34792" "-XX:MaxPermSize=256m" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@100.65.21.199:34792" "--executor-id" "14" 
> "--hostname" "100.65.21.223" "--cores" "5" "--app-id" 
> "app-20160624162905-0469" "--worker-url" "spark://Worker@100.65.21.223:46581"
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
> 16/06/24 16:31:18 INFO Worker: Successfully registered with master 
> spark://100.65.21.199:7077
> 16/06/24 16:31:18 INFO Worker: Worker cleanup enabled; old application 
> directories will be deleted in: /data/spark-1.6.1-bin-cdh4/work
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Connecting to master 100.65.21.199:7077...
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to reconnect.
> 16/06/24 16:31:18 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 16/06/24 16:31:18 INFO Worker: Master with url spark://100.65.21.199:7077 
> requested this worker to 

[jira] [Created] (SPARK-16191) Code-Generated SpecificColumnarIterator fails for wide pivot with caching

2016-06-24 Thread Matthew Livesey (JIRA)
Matthew Livesey created SPARK-16191:
---

 Summary: Code-Generated SpecificColumnarIterator fails for wide 
pivot with caching
 Key: SPARK-16191
 URL: https://issues.apache.org/jira/browse/SPARK-16191
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
Reporter: Matthew Livesey


When caching a pivot of more than 2260 columns, the instance of 
SpecificColumnarIterator which is generated by code-generation fails to be 
compiled with:

bq. failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of 
method \"()Z\" of class 
\"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator\"
 grows beyond 64 KB

This can be re-produced in PySpark with the following (it took some trial and 
error to find that 2261 is the magic number at which the generated class breaks 
the 64KB limit).

{code}
def build_pivot(width):
categories = ["cat_%s" % i for i in range(0,width)]
customers = ["cust_%s" % i for i in range(0,10)]
rows = []
for cust in customers:
for cat in categories:
for i in range(0,4):
row = (cust, cat, i, 7.0)
rows.append(row)
rdd = sc.parallelize(rows)
df = sqlContext.createDataFrame(rdd, ["customer", "category", "instance", 
"value"])
pivot_value_rows = 
df.select("category").distinct().orderBy("category").collect()
pivot_values = [r.category for r in pivot_value_rows]
import pyspark.sql.functions as func
pivot = df.groupBy('customer').pivot("category", 
pivot_values).agg(func.sum(df.value)).cache()
pivot.write.save('my_pivot', mode='overwrite')

for i in [2260, 2261]:
try:
build_pivot(i)
print "Succeeded for %s" % i
except:
print "Failed for %s" % i
{code}

Removing the `cache()` call avoids the problem and allows wider pivots, since 
ColumnarIterator is specifically related to caching it does not get generated 
where caching is not used.

This could be symptomatic of a general problem that generated code can break 
the 64KB bytecode limit, and so occur in other cases as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property

2016-06-24 Thread Jonathan Taws (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348108#comment-15348108
 ] 

Jonathan Taws commented on SPARK-15917:
---

I made a change to the *StandaloneSchedulerBackend* on the initial executor 
limit to take the executor instances property and it seems to be the only 
effective change needed for it to be taken into account. 

This is the current behavior with this change : 
* If the {{executor.cores}} property isn't set, the {{executor.instances}} 
property will be rendered useless as one executor will just take all of the 
cores available
* If the {{executor.cores}} property is set : 
** and {{executor.instances}} * {{executor.cores}} *<=* {{cores.max}}, then 
{{executor.instances}} will be spawned 
** and {{executor.instances}} * {{executor.cores}} *>* {{cores.max}}, then as 
many executors will be spawned as it is possible - basically the previous 
behavior when only {{executor.cores}} was set
** in the case where {{executor.memory}} is set, all constraints are taken into 
account based on the number of cores and memory per worker assigned (e.g. : if 
we requested 3 executors, and each executor has 2 cores and 8gb of memory on a 
16gb and 8 cores worker, we get only 2 executors)   

It looks pretty consistent to me, what do you think ?

I will move on to the exception throwing and doing a few updates to the 
documentation afterwards. 

I have an issue though :
- If we set the number of executors to 0, no executors are allocated and thus 
we can't run any job. Should we add a warning or even throw an error stating 
that no executors were spawned on the worker ? 

> Define the number of executors in standalone mode with an easy-to-use property
> --
>
> Key: SPARK-15917
> URL: https://issues.apache.org/jira/browse/SPARK-15917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> After stumbling across a few StackOverflow posts around the issue of using a 
> fixed number of executors in standalone mode (non-YARN), I was wondering if 
> we could not add an easier way to set this parameter than having to resort to 
> some calculations based on the number of cores and the memory you have 
> available on your worker. 
> For example, let's say I have 8 cores and 30GB of memory available :
>  - If no option is passed, one executor will be spawned with 8 cores and 1GB 
> of memory allocated.
>  - However, if I want to have only *2* executors, and to use 2 cores and 10GB 
> of memory per executor, I will end up with *3* executors (as the available 
> memory will limit the number of executors) instead of the 2 I was hoping for.
> Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I 
> want, but would it not be easier to add a {{--num-executors}}-like option to 
> standalone mode to be able to really fine-tune the configuration ? This 
> option is already available in YARN mode.
> From my understanding, I don't see any other option lying around that can 
> help achieve this.  
> This seems to be slightly disturbing for newcomers, and standalone mode is 
> probably the first thing anyone will use to just try out Spark or test some 
> configuration.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16192) Improve the type check of CollectSet in CheckAnalysis

2016-06-24 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-16192:


 Summary: Improve the type check of CollectSet in CheckAnalysis
 Key: SPARK-16192
 URL: https://issues.apache.org/jira/browse/SPARK-16192
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.1
Reporter: Takeshi Yamamuro


`CollectSet` cannot have map-typed data because MapTypeData does not implement 
`equals`. So, if we find map type in `CollectSet`, queries fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6685) Use DSYRK to compute AtA in ALS

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6685:
---

Assignee: (was: Apache Spark)

> Use DSYRK to compute AtA in ALS
> ---
>
> Key: SPARK-6685
> URL: https://issues.apache.org/jira/browse/SPARK-6685
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Priority: Minor
>
> Now we use DSPR to compute AtA in ALS, which is a Level 2 BLAS routine. We 
> should switch to DSYRK to use native BLAS to accelerate the computation. The 
> factors should remain dense vectors. And we can pre-allocate a buffer to 
> stack vectors and do Level 3. This requires some benchmark to demonstrate the 
> improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6685) Use DSYRK to compute AtA in ALS

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6685:
---

Assignee: Apache Spark

> Use DSYRK to compute AtA in ALS
> ---
>
> Key: SPARK-6685
> URL: https://issues.apache.org/jira/browse/SPARK-6685
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Minor
>
> Now we use DSPR to compute AtA in ALS, which is a Level 2 BLAS routine. We 
> should switch to DSYRK to use native BLAS to accelerate the computation. The 
> factors should remain dense vectors. And we can pre-allocate a buffer to 
> stack vectors and do Level 3. This requires some benchmark to demonstrate the 
> improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6685) Use DSYRK to compute AtA in ALS

2016-06-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348116#comment-15348116
 ] 

Apache Spark commented on SPARK-6685:
-

User 'hqzizania' has created a pull request for this issue:
https://github.com/apache/spark/pull/13891

> Use DSYRK to compute AtA in ALS
> ---
>
> Key: SPARK-6685
> URL: https://issues.apache.org/jira/browse/SPARK-6685
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Priority: Minor
>
> Now we use DSPR to compute AtA in ALS, which is a Level 2 BLAS routine. We 
> should switch to DSYRK to use native BLAS to accelerate the computation. The 
> factors should remain dense vectors. And we can pre-allocate a buffer to 
> stack vectors and do Level 3. This requires some benchmark to demonstrate the 
> improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16192) Improve the type check of CollectSet in CheckAnalysis

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16192:


Assignee: Apache Spark

> Improve the type check of CollectSet in CheckAnalysis
> -
>
> Key: SPARK-16192
> URL: https://issues.apache.org/jira/browse/SPARK-16192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>
> `CollectSet` cannot have map-typed data because MapTypeData does not 
> implement `equals`. So, if we find map type in `CollectSet`, queries fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16192) Improve the type check of CollectSet in CheckAnalysis

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16192:


Assignee: (was: Apache Spark)

> Improve the type check of CollectSet in CheckAnalysis
> -
>
> Key: SPARK-16192
> URL: https://issues.apache.org/jira/browse/SPARK-16192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Takeshi Yamamuro
>
> `CollectSet` cannot have map-typed data because MapTypeData does not 
> implement `equals`. So, if we find map type in `CollectSet`, queries fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16192) Improve the type check of CollectSet in CheckAnalysis

2016-06-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348119#comment-15348119
 ] 

Apache Spark commented on SPARK-16192:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/13892

> Improve the type check of CollectSet in CheckAnalysis
> -
>
> Key: SPARK-16192
> URL: https://issues.apache.org/jira/browse/SPARK-16192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Takeshi Yamamuro
>
> `CollectSet` cannot have map-typed data because MapTypeData does not 
> implement `equals`. So, if we find map type in `CollectSet`, queries fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16188) Spark sql create a lot of small files

2016-06-24 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-16188:
--
Priority: Major  (was: Minor)

> Spark sql create a lot of small files
> -
>
> Key: SPARK-16188
> URL: https://issues.apache.org/jira/browse/SPARK-16188
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: spark 1.6.1
>Reporter: cen yuhai
>
> I find that spark sql will create files as many as partition size. When the 
> results are small, there will be too many small files and most of them are 
> empty. 
> Hive have a function to detect the avg of file size. If  avg file size is 
> smaller than "hive.merge.smallfiles.avgsize", hive will add a job to merge 
> files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15997) Audit ml.feature Update documentation for ml feature transformers

2016-06-24 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15997.

   Resolution: Fixed
Fix Version/s: 2.0.1

Issue resolved by pull request 13745
[https://github.com/apache/spark/pull/13745]

> Audit ml.feature Update documentation for ml feature transformers
> -
>
> Key: SPARK-15997
> URL: https://issues.apache.org/jira/browse/SPARK-15997
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Gayathri Murali
>Assignee: Gayathri Murali
> Fix For: 2.0.1
>
>
> This JIRA is a subtask of SPARK-15100 and improves documentation for new 
> features added to 
> 1. HashingTF
> 2. Countvectorizer
> 3. QuantileDiscretizer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16149) API consistency discussion: CountVectorizer.{minDF -> minDocFreq, minTF -> minTermFreq}

2016-06-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348173#comment-15348173
 ] 

Nick Pentreath commented on SPARK-16149:


I'd generally vote for:
* if it's a new param / model, be consistent with external impl if it's widely 
used (e.g. scikit-learn, R), unless the naming of that param in that library is 
itself inconsistent or perhaps difficult to understand;
* if it already exists elsewhere within spark.ml, then be consistent with 
existing Spark naming.

For this case, we could deprecate one of the versions (easier to deprecate the 
one in IDF rather than 2 in CV I'd say, so go for {{minDF}} across the board).

> API consistency discussion: CountVectorizer.{minDF -> minDocFreq, minTF -> 
> minTermFreq}
> ---
>
> Key: SPARK-16149
> URL: https://issues.apache.org/jira/browse/SPARK-16149
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> We used `minDF` and `minTF` in CountVectorizer and `minDocFreq` in IDF. It 
> would be nice to keep the naming consistent. This was discussed in 
> https://github.com/apache/spark/pull/7388 and the decision was made based on 
> sklearn compatibility. However, we didn't look broadly across MLlib APIs. 
> Maybe we can live with this small inconsistency but it would be nice to 
> discuss the guideline (consistent with other libraries or existing ones in 
> MLlib).
> cc: [~josephkb] [~yuhaoyan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348184#comment-15348184
 ] 

Apache Spark commented on SPARK-14172:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/13893

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14172:


Assignee: (was: Apache Spark)

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14172:


Assignee: Apache Spark

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Assignee: Apache Spark
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15955) Failed Spark application returns with exitcode equals to zero

2016-06-24 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348220#comment-15348220
 ] 

Thomas Graves commented on SPARK-15955:
---

there are some corner cases in spark 1.x that we report success when we 
shouldn't. The reason is we don't really know what happened and for backwards 
compatibility it defaults to success. That behavior was changed in 2.x.

But either way there isn't enough information here, why did the application 
fail? dig into the logs or the ui.  Pi should not normally fail.  If you can 
provide full information on how to reproduce and more context that would be 
great.

> Failed Spark application returns with exitcode equals to zero
> -
>
> Key: SPARK-15955
> URL: https://issues.apache.org/jira/browse/SPARK-15955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Scenario:
> * Set up cluster with wire-encryption enabled.
> * set 'spark.authenticate.enableSaslEncryption' = 'false' and 
> 'spark.shuffle.service.enabled' :'true'
> * run sparkPi application.
> {code}
> client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
> diagnostics: Max number of executor failures (3) reached
> ApplicationMaster host: xx.xx.xx.xxx
> ApplicationMaster RPC port: 0
> queue: default
> start time: 1465941051976
> final status: FAILED
> tracking URL: https://xx.xx.xx.xxx:8090/proxy/application_1465925772890_0016/
> user: hrt_qa
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1465925772890_0016 finished with failed status
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1092)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1139)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> INFO ShutdownHookManager: Shutdown hook called{code}
> This spark application exits with exitcode = 0. Failed application should not 
> return with exitcode = 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15254) Improve ML pipeline Cross Validation Scaladoc & PyDoc

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15254:


Assignee: Apache Spark

> Improve ML pipeline Cross Validation Scaladoc & PyDoc
> -
>
> Key: SPARK-15254
> URL: https://issues.apache.org/jira/browse/SPARK-15254
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> The ML pipeline Cross Validation Scaladoc & PyDoc is very sparse - we should 
> fill this out with a more concrete description.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15254) Improve ML pipeline Cross Validation Scaladoc & PyDoc

2016-06-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348325#comment-15348325
 ] 

Apache Spark commented on SPARK-15254:
--

User 'krishnakalyan3' has created a pull request for this issue:
https://github.com/apache/spark/pull/13894

> Improve ML pipeline Cross Validation Scaladoc & PyDoc
> -
>
> Key: SPARK-15254
> URL: https://issues.apache.org/jira/browse/SPARK-15254
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: holdenk
>Priority: Minor
>
> The ML pipeline Cross Validation Scaladoc & PyDoc is very sparse - we should 
> fill this out with a more concrete description.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15254) Improve ML pipeline Cross Validation Scaladoc & PyDoc

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15254:


Assignee: (was: Apache Spark)

> Improve ML pipeline Cross Validation Scaladoc & PyDoc
> -
>
> Key: SPARK-15254
> URL: https://issues.apache.org/jira/browse/SPARK-15254
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: holdenk
>Priority: Minor
>
> The ML pipeline Cross Validation Scaladoc & PyDoc is very sparse - we should 
> fill this out with a more concrete description.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15963) `TaskKilledException` is not correctly caught in `Executor.TaskRunner`

2016-06-24 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-15963:
-
Assignee: Liwei Lin

> `TaskKilledException` is not correctly caught in `Executor.TaskRunner`
> --
>
> Key: SPARK-15963
> URL: https://issues.apache.org/jira/browse/SPARK-15963
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Liwei Lin
>Assignee: Liwei Lin
>
> Currently in {{Executor.TaskRunner}}, we:
> {code}
> try {...}
> catch {
>   case _: TaskKilledException | _: InterruptedException if task.killed =>
>   ...
> }
> {code}
> What we intended was:
> - {{TaskKilledException}} OR ({{InterruptedException}} AND {{task.killed}})
> But fact is:
> - ({{TaskKilledException}} OR {{InterruptedException}}) AND {{task.killed}}
> As a consequence, sometimes we can not catch {{TaskKilledException}} and will 
> incorrectly report our task status as {{FAILED}} (which should really be 
> {{KILLED}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15963) `TaskKilledException` is not correctly caught in `Executor.TaskRunner`

2016-06-24 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-15963.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 13685
[https://github.com/apache/spark/pull/13685]

> `TaskKilledException` is not correctly caught in `Executor.TaskRunner`
> --
>
> Key: SPARK-15963
> URL: https://issues.apache.org/jira/browse/SPARK-15963
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Liwei Lin
>Assignee: Liwei Lin
> Fix For: 2.1.0
>
>
> Currently in {{Executor.TaskRunner}}, we:
> {code}
> try {...}
> catch {
>   case _: TaskKilledException | _: InterruptedException if task.killed =>
>   ...
> }
> {code}
> What we intended was:
> - {{TaskKilledException}} OR ({{InterruptedException}} AND {{task.killed}})
> But fact is:
> - ({{TaskKilledException}} OR {{InterruptedException}}) AND {{task.killed}}
> As a consequence, sometimes we can not catch {{TaskKilledException}} and will 
> incorrectly report our task status as {{FAILED}} (which should really be 
> {{KILLED}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15963) `TaskKilledException` is not correctly caught in `Executor.TaskRunner`

2016-06-24 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-15963:
-
Description: 
Before this change, if either of the following cases happened to a task , the 
task would be marked as FAILED instead of KILLED:

the task was killed before it was deserialized
executor.kill() marked taskRunner.killed, but before calling task.killed() the 
worker thread threw the TaskKilledException

Currently in {{Executor.TaskRunner}}, we:

{code}
try {...}
catch {
  case _: TaskKilledException | _: InterruptedException if task.killed =>
  ...
}
{code}
What we intended was:
- {{TaskKilledException}} OR ({{InterruptedException}} AND {{task.killed}})

But fact is:
- ({{TaskKilledException}} OR {{InterruptedException}}) AND {{task.killed}}

As a consequence, sometimes we can not catch {{TaskKilledException}} and will 
incorrectly report our task status as {{FAILED}} (which should really be 
{{KILLED}}).

  was:
Currently in {{Executor.TaskRunner}}, we:

{code}
try {...}
catch {
  case _: TaskKilledException | _: InterruptedException if task.killed =>
  ...
}
{code}
What we intended was:
- {{TaskKilledException}} OR ({{InterruptedException}} AND {{task.killed}})

But fact is:
- ({{TaskKilledException}} OR {{InterruptedException}}) AND {{task.killed}}

As a consequence, sometimes we can not catch {{TaskKilledException}} and will 
incorrectly report our task status as {{FAILED}} (which should really be 
{{KILLED}}).


> `TaskKilledException` is not correctly caught in `Executor.TaskRunner`
> --
>
> Key: SPARK-15963
> URL: https://issues.apache.org/jira/browse/SPARK-15963
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Liwei Lin
>Assignee: Liwei Lin
> Fix For: 2.1.0
>
>
> Before this change, if either of the following cases happened to a task , the 
> task would be marked as FAILED instead of KILLED:
> the task was killed before it was deserialized
> executor.kill() marked taskRunner.killed, but before calling task.killed() 
> the worker thread threw the TaskKilledException
> Currently in {{Executor.TaskRunner}}, we:
> {code}
> try {...}
> catch {
>   case _: TaskKilledException | _: InterruptedException if task.killed =>
>   ...
> }
> {code}
> What we intended was:
> - {{TaskKilledException}} OR ({{InterruptedException}} AND {{task.killed}})
> But fact is:
> - ({{TaskKilledException}} OR {{InterruptedException}}) AND {{task.killed}}
> As a consequence, sometimes we can not catch {{TaskKilledException}} and will 
> incorrectly report our task status as {{FAILED}} (which should really be 
> {{KILLED}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16112) R programming guide update for gapply

2016-06-24 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348430#comment-15348430
 ] 

Narine Kokhlikyan commented on SPARK-16112:
---

[~felixcheung], [~shivaram], [~sunrui], Should I add the programming guide for 
gapplyCollect too ? 
It hasn't been merged yet, that's the reason why I'm holding on on this.

> R programming guide update for gapply
> -
>
> Key: SPARK-16112
> URL: https://issues.apache.org/jira/browse/SPARK-16112
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kai Jiang
>Priority: Blocker
>
> Update programming guide for spark.gapply.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16164) CombineFilters should keep the ordering in the logical plan

2016-06-24 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348438#comment-15348438
 ] 

Xiangrui Meng commented on SPARK-16164:
---

[~lian cheng] See my last comment on GitHub:

I didn't use UDF explicitly in a filter expression. It is like the following:

{code}
filter b > 0
set a = udf(b)
filter a > 2
and the optimizer merged them into
{code}

{code}
filter udf(b) > 2 and b > 0
{code}

This could happen for any UDFs that throw exceptions. Are we assuming that UDFs 
never throw exceptions?

> CombineFilters should keep the ordering in the logical plan
> ---
>
> Key: SPARK-16164
> URL: https://issues.apache.org/jira/browse/SPARK-16164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Dongjoon Hyun
> Fix For: 2.0.1, 2.1.0
>
>
> [~cmccubbin] reported a bug when he used StringIndexer in an ML pipeline with 
> additional filters. It seems that during filter pushdown, we changed the 
> ordering in the logical plan. I'm not sure whether we should treat this as a 
> bug.
> {code}
> val df1 = (0 until 3).map(_.toString).toDF
> val indexer = new StringIndexer()
>   .setInputCol("value")
>   .setOutputCol("idx")
>   .setHandleInvalid("skip")
>   .fit(df1)
> val df2 = (0 until 5).map(_.toString).toDF
> val predictions = indexer.transform(df2)
> predictions.show() // this is okay
> predictions.where('idx > 2).show() // this will throw an exception
> {code}
> Please see the notebook at 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html
>  for error messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10073) Python withColumn for existing column name not consistent with scala

2016-06-24 Thread Russell Bradberry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348468#comment-15348468
 ] 

Russell Bradberry commented on SPARK-10073:
---

with this, you added: 
{code}assert isinstance(col, Column), "col should be Column"{code}  

Which seems to be asserting against UDF types which, according to 
documentation, are valid as columns.

How would I use 
{code}.withColumn('colname', udf(f, ftype)){code}

> Python withColumn for existing column name not consistent with scala
> 
>
> Key: SPARK-10073
> URL: https://issues.apache.org/jira/browse/SPARK-10073
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michael Armbrust
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.5.0
>
>
> The same code as below works in Scala (replacing the old column with the new 
> one).
> {code}
> from pyspark.sql import Row
> df = sc.parallelize([Row(a=1)]).toDF()
> df.withColumn("a", df.a).select("a")
> ---
> AnalysisException Traceback (most recent call last)
>  in ()
>   1 from pyspark.sql import Row
>   2 df = sc.parallelize([Row(a=1)]).toDF()
> > 3 df.withColumn("a", df.a).select("a")
> /home/ubuntu/databricks/spark/python/pyspark/sql/dataframe.py in select(self, 
> *cols)
> 764 [Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]
> 765 """
> --> 766 jdf = self._jdf.select(self._jcols(*cols))
> 767 return DataFrame(jdf, self.sql_ctx)
> 768 
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /home/ubuntu/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  38 s = e.java_exception.toString()
>  39 if s.startswith('org.apache.spark.sql.AnalysisException: 
> '):
> ---> 40 raise AnalysisException(s.split(': ', 1)[1])
>  41 if s.startswith('java.lang.IllegalArgumentException: '):
>  42 raise IllegalArgumentException(s.split(': ', 1)[1])
> AnalysisException: Reference 'a' is ambiguous, could be: a#894L, a#895L.;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16112) R programming guide update for gapply

2016-06-24 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348480#comment-15348480
 ] 

Shivaram Venkataraman commented on SPARK-16112:
---

Feel free to include both -- I'll make sure to merge the gapplyCollect PR 
before the programming guide gets merged

> R programming guide update for gapply
> -
>
> Key: SPARK-16112
> URL: https://issues.apache.org/jira/browse/SPARK-16112
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kai Jiang
>Priority: Blocker
>
> Update programming guide for spark.gapply.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16112) R programming guide update for gapply and gapplyCollect

2016-06-24 Thread Kai Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Jiang updated SPARK-16112:
--
Summary: R programming guide update for gapply and gapplyCollect  (was: R 
programming guide update for gapply)

> R programming guide update for gapply and gapplyCollect
> ---
>
> Key: SPARK-16112
> URL: https://issues.apache.org/jira/browse/SPARK-16112
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kai Jiang
>Priority: Blocker
>
> Update programming guide for spark.gapply.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16193) Address flaky ExternalAppendOnlyMapSuite spilling tests

2016-06-24 Thread Sean Owen (JIRA)
Sean Owen created SPARK-16193:
-

 Summary: Address flaky ExternalAppendOnlyMapSuite spilling tests
 Key: SPARK-16193
 URL: https://issues.apache.org/jira/browse/SPARK-16193
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 2.0.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor


We've seen tests fail, for different codecs and operations, like so, most 
recently from 2.0.0 RC1:

{code}
- spilling with compression *** FAILED ***
  java.lang.Exception: Test failed with compression using codec 
org.apache.spark.io.LZ4CompressionCodec:

assertion failed: expected groupByKey to spill, but did not
  at scala.Predef$.assert(Predef.scala:170)
  at org.apache.spark.TestUtils$.assertSpilled(TestUtils.scala:170)
  at 
org.apache.spark.util.collection.ExternalAppendOnlyMapSuite.org$apache$spark$util$collection$ExternalAppendOnlyMapSuite$$testSimpleSpilling(ExternalAppendOnlyMapSuite.scala:253)
  at 
org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$10$$anonfun$apply$mcV$sp$8.apply(ExternalAppendOnlyMapSuite.scala:218)
  at 
org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$10$$anonfun$apply$mcV$sp$8.apply(ExternalAppendOnlyMapSuite.scala:216)
  at scala.collection.immutable.Stream.foreach(Stream.scala:594)
  ...
{code}

My theory is that the listener doesn't receive notification of the spilled 
stages early enough to show up in the test. It should wait until the job is 
done before letting the test proceed to query the number of spilled stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16193) Address flaky ExternalAppendOnlyMapSuite spilling tests

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16193:


Assignee: Sean Owen  (was: Apache Spark)

> Address flaky ExternalAppendOnlyMapSuite spilling tests
> ---
>
> Key: SPARK-16193
> URL: https://issues.apache.org/jira/browse/SPARK-16193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> We've seen tests fail, for different codecs and operations, like so, most 
> recently from 2.0.0 RC1:
> {code}
> - spilling with compression *** FAILED ***
>   java.lang.Exception: Test failed with compression using codec 
> org.apache.spark.io.LZ4CompressionCodec:
> assertion failed: expected groupByKey to spill, but did not
>   at scala.Predef$.assert(Predef.scala:170)
>   at org.apache.spark.TestUtils$.assertSpilled(TestUtils.scala:170)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMapSuite.org$apache$spark$util$collection$ExternalAppendOnlyMapSuite$$testSimpleSpilling(ExternalAppendOnlyMapSuite.scala:253)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$10$$anonfun$apply$mcV$sp$8.apply(ExternalAppendOnlyMapSuite.scala:218)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$10$$anonfun$apply$mcV$sp$8.apply(ExternalAppendOnlyMapSuite.scala:216)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   ...
> {code}
> My theory is that the listener doesn't receive notification of the spilled 
> stages early enough to show up in the test. It should wait until the job is 
> done before letting the test proceed to query the number of spilled stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16193) Address flaky ExternalAppendOnlyMapSuite spilling tests

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16193:


Assignee: Apache Spark  (was: Sean Owen)

> Address flaky ExternalAppendOnlyMapSuite spilling tests
> ---
>
> Key: SPARK-16193
> URL: https://issues.apache.org/jira/browse/SPARK-16193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> We've seen tests fail, for different codecs and operations, like so, most 
> recently from 2.0.0 RC1:
> {code}
> - spilling with compression *** FAILED ***
>   java.lang.Exception: Test failed with compression using codec 
> org.apache.spark.io.LZ4CompressionCodec:
> assertion failed: expected groupByKey to spill, but did not
>   at scala.Predef$.assert(Predef.scala:170)
>   at org.apache.spark.TestUtils$.assertSpilled(TestUtils.scala:170)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMapSuite.org$apache$spark$util$collection$ExternalAppendOnlyMapSuite$$testSimpleSpilling(ExternalAppendOnlyMapSuite.scala:253)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$10$$anonfun$apply$mcV$sp$8.apply(ExternalAppendOnlyMapSuite.scala:218)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$10$$anonfun$apply$mcV$sp$8.apply(ExternalAppendOnlyMapSuite.scala:216)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   ...
> {code}
> My theory is that the listener doesn't receive notification of the spilled 
> stages early enough to show up in the test. It should wait until the job is 
> done before letting the test proceed to query the number of spilled stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16194) No way to dynamically set env vars on driver in cluster mode

2016-06-24 Thread Michael Gummelt (JIRA)
Michael Gummelt created SPARK-16194:
---

 Summary: No way to dynamically set env vars on driver in cluster 
mode
 Key: SPARK-16194
 URL: https://issues.apache.org/jira/browse/SPARK-16194
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Michael Gummelt


I often need to dynamically configure a driver when submitting in cluster mode, 
but there's currently no way of setting env vars.  {{spark-env.sh}} lets me set 
env vars, but I have to statically build that into my spark distribution.  I 
need a solution for specifying them in {{spark-submit}}.  Much like 
{{spark.executorEnv.[ENV]}}, but for drivers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16194) No way to dynamically set env vars on driver in cluster mode

2016-06-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16194:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

Env variables are pretty much from outside Spark right?
Generally, these are being removed and deprecated anyway.
Any chance of just using a sys property or command line alternative?

> No way to dynamically set env vars on driver in cluster mode
> 
>
> Key: SPARK-16194
> URL: https://issues.apache.org/jira/browse/SPARK-16194
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>Priority: Minor
>
> I often need to dynamically configure a driver when submitting in cluster 
> mode, but there's currently no way of setting env vars.  {{spark-env.sh}} 
> lets me set env vars, but I have to statically build that into my spark 
> distribution.  I need a solution for specifying them in {{spark-submit}}.  
> Much like {{spark.executorEnv.[ENV]}}, but for drivers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16193) Address flaky ExternalAppendOnlyMapSuite spilling tests

2016-06-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348598#comment-15348598
 ] 

Apache Spark commented on SPARK-16193:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/13896

> Address flaky ExternalAppendOnlyMapSuite spilling tests
> ---
>
> Key: SPARK-16193
> URL: https://issues.apache.org/jira/browse/SPARK-16193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> We've seen tests fail, for different codecs and operations, like so, most 
> recently from 2.0.0 RC1:
> {code}
> - spilling with compression *** FAILED ***
>   java.lang.Exception: Test failed with compression using codec 
> org.apache.spark.io.LZ4CompressionCodec:
> assertion failed: expected groupByKey to spill, but did not
>   at scala.Predef$.assert(Predef.scala:170)
>   at org.apache.spark.TestUtils$.assertSpilled(TestUtils.scala:170)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMapSuite.org$apache$spark$util$collection$ExternalAppendOnlyMapSuite$$testSimpleSpilling(ExternalAppendOnlyMapSuite.scala:253)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$10$$anonfun$apply$mcV$sp$8.apply(ExternalAppendOnlyMapSuite.scala:218)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$10$$anonfun$apply$mcV$sp$8.apply(ExternalAppendOnlyMapSuite.scala:216)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   ...
> {code}
> My theory is that the listener doesn't receive notification of the spilled 
> stages early enough to show up in the test. It should wait until the job is 
> done before letting the test proceed to query the number of spilled stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16195) Allow users to specify empty over clause in window expressions through dataset API

2016-06-24 Thread Dilip Biswal (JIRA)
Dilip Biswal created SPARK-16195:


 Summary: Allow users to specify empty over clause in window 
expressions through dataset API
 Key: SPARK-16195
 URL: https://issues.apache.org/jira/browse/SPARK-16195
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Dilip Biswal
Priority: Minor


In SQL, its allowed to specify an empty OVER clause in the window expression.
 
select area, sum(product) over () as c from windowData
where product > 3 group by area, product
having avg(month) > 0 order by avg(month), product

In this case the analytic function sum is presented based on all the rows of 
the result set

Currently its not allowed through dataset API.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16195) Allow users to specify empty over clause in window expressions through dataset API

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16195:


Assignee: (was: Apache Spark)

> Allow users to specify empty over clause in window expressions through 
> dataset API
> --
>
> Key: SPARK-16195
> URL: https://issues.apache.org/jira/browse/SPARK-16195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> In SQL, its allowed to specify an empty OVER clause in the window expression.
>  
> select area, sum(product) over () as c from windowData
> where product > 3 group by area, product
> having avg(month) > 0 order by avg(month), product
> In this case the analytic function sum is presented based on all the rows of 
> the result set
> Currently its not allowed through dataset API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16195) Allow users to specify empty over clause in window expressions through dataset API

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16195:


Assignee: Apache Spark

> Allow users to specify empty over clause in window expressions through 
> dataset API
> --
>
> Key: SPARK-16195
> URL: https://issues.apache.org/jira/browse/SPARK-16195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dilip Biswal
>Assignee: Apache Spark
>Priority: Minor
>
> In SQL, its allowed to specify an empty OVER clause in the window expression.
>  
> select area, sum(product) over () as c from windowData
> where product > 3 group by area, product
> having avg(month) > 0 order by avg(month), product
> In this case the analytic function sum is presented based on all the rows of 
> the result set
> Currently its not allowed through dataset API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16195) Allow users to specify empty over clause in window expressions through dataset API

2016-06-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348628#comment-15348628
 ] 

Apache Spark commented on SPARK-16195:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/13897

> Allow users to specify empty over clause in window expressions through 
> dataset API
> --
>
> Key: SPARK-16195
> URL: https://issues.apache.org/jira/browse/SPARK-16195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> In SQL, its allowed to specify an empty OVER clause in the window expression.
>  
> select area, sum(product) over () as c from windowData
> where product > 3 group by area, product
> having avg(month) > 0 order by avg(month), product
> In this case the analytic function sum is presented based on all the rows of 
> the result set
> Currently its not allowed through dataset API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16194) No way to dynamically set env vars on driver in cluster mode

2016-06-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348643#comment-15348643
 ] 

Marcelo Vanzin commented on SPARK-16194:


For YARN you have {{spark.yarn.appMasterEnv.ENV_VAR}} just like for the 
executor env. Not sure if other cluster managers have anything like that.

> No way to dynamically set env vars on driver in cluster mode
> 
>
> Key: SPARK-16194
> URL: https://issues.apache.org/jira/browse/SPARK-16194
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>Priority: Minor
>
> I often need to dynamically configure a driver when submitting in cluster 
> mode, but there's currently no way of setting env vars.  {{spark-env.sh}} 
> lets me set env vars, but I have to statically build that into my spark 
> distribution.  I need a solution for specifying them in {{spark-submit}}.  
> Much like {{spark.executorEnv.[ENV]}}, but for drivers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16194) No way to dynamically set env vars on driver in cluster mode

2016-06-24 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348645#comment-15348645
 ] 

Michael Gummelt commented on SPARK-16194:
-

> Env variables are pretty much from outside Spark right?

They're my own env vars, yea.  The motivating case is setting "SSL_ENABLED" on 
the driver to enable mesos SSL support.

> Generally, these are being removed and deprecated anyway.

You mean the Spark env vars like SPARK_SUBMIT_OPTS?  That's good to hear, but 
that's not what I'm talking about.

> Any chance of just using a sys property or command line alternative?

libmesos ultimately needs SSL_ENABLED, so every spark job I submit would have 
to convert from the sys property to the env var, which is infeasible.

I realize this may be a corner case, but it would bring us to consistency with 
spark.executorEnv.[ENV]

> No way to dynamically set env vars on driver in cluster mode
> 
>
> Key: SPARK-16194
> URL: https://issues.apache.org/jira/browse/SPARK-16194
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>Priority: Minor
>
> I often need to dynamically configure a driver when submitting in cluster 
> mode, but there's currently no way of setting env vars.  {{spark-env.sh}} 
> lets me set env vars, but I have to statically build that into my spark 
> distribution.  I need a solution for specifying them in {{spark-submit}}.  
> Much like {{spark.executorEnv.[ENV]}}, but for drivers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2016-06-24 Thread Andrew Or (JIRA)
Andrew Or created SPARK-16196:
-

 Summary: Optimize in-memory scan performance using ColumnarBatches
 Key: SPARK-16196
 URL: https://issues.apache.org/jira/browse/SPARK-16196
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


A simple benchmark such as the following reveals inefficiencies in the existing 
in-memory scan implementation:
{code}
spark.range(N)
  .selectExpr("id", "floor(rand() * 1) as k")
  .createOrReplaceTempView("test")
val ds = spark.sql("select count(k), count(id) from test").cache()
ds.collect()
ds.collect()
{code}

There are many reasons why caching is slow. The biggest is that compression 
takes a long time. The second is that there are a lot of virtual function calls 
in this hot code path since the rows are processed using iterators. Further, 
the rows are converted to and from ByteBuffers, which are slow to read in 
general.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16197) Cleanup PySpark status api and example

2016-06-24 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-16197:


 Summary: Cleanup PySpark status api and example
 Key: SPARK-16197
 URL: https://issues.apache.org/jira/browse/SPARK-16197
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Bryan Cutler
Priority: Trivial


Cleanup of Status API to use SparkSession and also noticed that Status defines 
two empty classes without using 'pass' and two methods that do not return 
'None' explicitly if requested info can not be fetched.  These issues do not 
cause any errors, but it is good practice to use 'pass' on an empty class 
definitions and return 'None' for a function if the caller is expecting a 
return value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16194) No way to dynamically set env vars on driver in cluster mode

2016-06-24 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348650#comment-15348650
 ] 

Michael Gummelt commented on SPARK-16194:
-

Ah, yea, that's what I need.  I'd like the make this standard.

> No way to dynamically set env vars on driver in cluster mode
> 
>
> Key: SPARK-16194
> URL: https://issues.apache.org/jira/browse/SPARK-16194
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>Priority: Minor
>
> I often need to dynamically configure a driver when submitting in cluster 
> mode, but there's currently no way of setting env vars.  {{spark-env.sh}} 
> lets me set env vars, but I have to statically build that into my spark 
> distribution.  I need a solution for specifying them in {{spark-submit}}.  
> Much like {{spark.executorEnv.[ENV]}}, but for drivers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16197) Cleanup PySpark status api and example

2016-06-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348679#comment-15348679
 ] 

Apache Spark commented on SPARK-16197:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/13898

> Cleanup PySpark status api and example
> --
>
> Key: SPARK-16197
> URL: https://issues.apache.org/jira/browse/SPARK-16197
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Bryan Cutler
>Priority: Trivial
>
> Cleanup of Status API to use SparkSession and also noticed that Status 
> defines two empty classes without using 'pass' and two methods that do not 
> return 'None' explicitly if requested info can not be fetched.  These issues 
> do not cause any errors, but it is good practice to use 'pass' on an empty 
> class definitions and return 'None' for a function if the caller is expecting 
> a return value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16197) Cleanup PySpark status api and example

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16197:


Assignee: Apache Spark

> Cleanup PySpark status api and example
> --
>
> Key: SPARK-16197
> URL: https://issues.apache.org/jira/browse/SPARK-16197
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Trivial
>
> Cleanup of Status API to use SparkSession and also noticed that Status 
> defines two empty classes without using 'pass' and two methods that do not 
> return 'None' explicitly if requested info can not be fetched.  These issues 
> do not cause any errors, but it is good practice to use 'pass' on an empty 
> class definitions and return 'None' for a function if the caller is expecting 
> a return value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16197) Cleanup PySpark status api and example

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16197:


Assignee: (was: Apache Spark)

> Cleanup PySpark status api and example
> --
>
> Key: SPARK-16197
> URL: https://issues.apache.org/jira/browse/SPARK-16197
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Bryan Cutler
>Priority: Trivial
>
> Cleanup of Status API to use SparkSession and also noticed that Status 
> defines two empty classes without using 'pass' and two methods that do not 
> return 'None' explicitly if requested info can not be fetched.  These issues 
> do not cause any errors, but it is good practice to use 'pass' on an empty 
> class definitions and return 'None' for a function if the caller is expecting 
> a return value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16173) Can't join describe() of DataFrame in Scala 2.10

2016-06-24 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348688#comment-15348688
 ] 

Dongjoon Hyun commented on SPARK-16173:
---

Of course, with Scala 2.10.

> Can't join describe() of DataFrame in Scala 2.10
> 
>
> Key: SPARK-16173
> URL: https://issues.apache.org/jira/browse/SPARK-16173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Davies Liu
>
> descripbe() of DataFrame use Seq() (it's a Iterator actually) to create 
> another DataFrame, which can not be serialized in Scala 2.10.
> {code}
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2060)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
>   at 
> org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.NotSerializableException: 
> scala.collection.Iterator$$anon$11
> Serialization stack:
>   - object not serializable (class: scala.collection.Iterator$$anon$11, 
> value: empty iterator)
>   - field (class: scala.collection.Iterator$$anonfun$toStream$1, name: 
> $outer, type: interface scala.collection.Iterator)
>   - object (class scala.collection.Iterator$$anonfun$toStream$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream(WrappedArray(1), WrappedArray(2.0), WrappedArray(NaN), 
> WrappedArray(2), WrappedArray(2)))
>   - field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$zip$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream((WrappedArray(1),(count,)), 
> (WrappedArray(2.0),(mean,)), 
> (WrappedArray(NaN),(stddev,)), 
> (WrappedArray(2),(min,)), (WrappedArray(2),(max,
>   - field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$map$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream([count,1], [mean,2.0], [stddev,NaN], [min,2], [max,2]))
>   - field (cl

[jira] [Commented] (SPARK-16173) Can't join describe() of DataFrame in Scala 2.10

2016-06-24 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348685#comment-15348685
 ] 

Dongjoon Hyun commented on SPARK-16173:
---

Hi, [~davies] and [~bomeng].
If you don't mind, I'll make a PR soon.
I resolved the problem and tested on both Scala and Python shell.

> Can't join describe() of DataFrame in Scala 2.10
> 
>
> Key: SPARK-16173
> URL: https://issues.apache.org/jira/browse/SPARK-16173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Davies Liu
>
> descripbe() of DataFrame use Seq() (it's a Iterator actually) to create 
> another DataFrame, which can not be serialized in Scala 2.10.
> {code}
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2060)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
>   at 
> org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.NotSerializableException: 
> scala.collection.Iterator$$anon$11
> Serialization stack:
>   - object not serializable (class: scala.collection.Iterator$$anon$11, 
> value: empty iterator)
>   - field (class: scala.collection.Iterator$$anonfun$toStream$1, name: 
> $outer, type: interface scala.collection.Iterator)
>   - object (class scala.collection.Iterator$$anonfun$toStream$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream(WrappedArray(1), WrappedArray(2.0), WrappedArray(NaN), 
> WrappedArray(2), WrappedArray(2)))
>   - field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$zip$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream((WrappedArray(1),(count,)), 
> (WrappedArray(2.0),(mean,)), 
> (WrappedArray(NaN),(stddev,)), 
> (WrappedArray(2),(min,)), (WrappedArray(2),(max,
>   - field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$map$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.i

[jira] [Assigned] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16196:


Assignee: Apache Spark  (was: Andrew Or)

> Optimize in-memory scan performance using ColumnarBatches
> -
>
> Key: SPARK-16196
> URL: https://issues.apache.org/jira/browse/SPARK-16196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> A simple benchmark such as the following reveals inefficiencies in the 
> existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 1) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression 
> takes a long time. The second is that there are a lot of virtual function 
> calls in this hot code path since the rows are processed using iterators. 
> Further, the rows are converted to and from ByteBuffers, which are slow to 
> read in general.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2016-06-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348703#comment-15348703
 ] 

Apache Spark commented on SPARK-16196:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/13899

> Optimize in-memory scan performance using ColumnarBatches
> -
>
> Key: SPARK-16196
> URL: https://issues.apache.org/jira/browse/SPARK-16196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> A simple benchmark such as the following reveals inefficiencies in the 
> existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 1) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression 
> takes a long time. The second is that there are a lot of virtual function 
> calls in this hot code path since the rows are processed using iterators. 
> Further, the rows are converted to and from ByteBuffers, which are slow to 
> read in general.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16196:


Assignee: Andrew Or  (was: Apache Spark)

> Optimize in-memory scan performance using ColumnarBatches
> -
>
> Key: SPARK-16196
> URL: https://issues.apache.org/jira/browse/SPARK-16196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> A simple benchmark such as the following reveals inefficiencies in the 
> existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 1) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression 
> takes a long time. The second is that there are a lot of virtual function 
> calls in this hot code path since the rows are processed using iterators. 
> Further, the rows are converted to and from ByteBuffers, which are slow to 
> read in general.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16077) Python UDF may fail because of six

2016-06-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-16077:
--

Assignee: Davies Liu

> Python UDF may fail because of six
> --
>
> Key: SPARK-16077
> URL: https://issues.apache.org/jira/browse/SPARK-16077
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.6.3, 2.0.1
>
>
> six or other package may break pickle.whichmodule() in pickle:
> https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16077) Python UDF may fail because of six

2016-06-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-16077.

   Resolution: Fixed
Fix Version/s: 2.0.1
   1.6.3

Issue resolved by pull request 13788
[https://github.com/apache/spark/pull/13788]

> Python UDF may fail because of six
> --
>
> Key: SPARK-16077
> URL: https://issues.apache.org/jira/browse/SPARK-16077
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Davies Liu
> Fix For: 1.6.3, 2.0.1
>
>
> six or other package may break pickle.whichmodule() in pickle:
> https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16173) Can't join describe() of DataFrame in Scala 2.10

2016-06-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348713#comment-15348713
 ] 

Apache Spark commented on SPARK-16173:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/13900

> Can't join describe() of DataFrame in Scala 2.10
> 
>
> Key: SPARK-16173
> URL: https://issues.apache.org/jira/browse/SPARK-16173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Davies Liu
>
> descripbe() of DataFrame use Seq() (it's a Iterator actually) to create 
> another DataFrame, which can not be serialized in Scala 2.10.
> {code}
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2060)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
>   at 
> org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.NotSerializableException: 
> scala.collection.Iterator$$anon$11
> Serialization stack:
>   - object not serializable (class: scala.collection.Iterator$$anon$11, 
> value: empty iterator)
>   - field (class: scala.collection.Iterator$$anonfun$toStream$1, name: 
> $outer, type: interface scala.collection.Iterator)
>   - object (class scala.collection.Iterator$$anonfun$toStream$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream(WrappedArray(1), WrappedArray(2.0), WrappedArray(NaN), 
> WrappedArray(2), WrappedArray(2)))
>   - field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$zip$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream((WrappedArray(1),(count,)), 
> (WrappedArray(2.0),(mean,)), 
> (WrappedArray(NaN),(stddev,)), 
> (WrappedArray(2),(min,)), (WrappedArray(2),(max,
>   - field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$map$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Strea

[jira] [Assigned] (SPARK-16173) Can't join describe() of DataFrame in Scala 2.10

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16173:


Assignee: (was: Apache Spark)

> Can't join describe() of DataFrame in Scala 2.10
> 
>
> Key: SPARK-16173
> URL: https://issues.apache.org/jira/browse/SPARK-16173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Davies Liu
>
> descripbe() of DataFrame use Seq() (it's a Iterator actually) to create 
> another DataFrame, which can not be serialized in Scala 2.10.
> {code}
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2060)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
>   at 
> org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.NotSerializableException: 
> scala.collection.Iterator$$anon$11
> Serialization stack:
>   - object not serializable (class: scala.collection.Iterator$$anon$11, 
> value: empty iterator)
>   - field (class: scala.collection.Iterator$$anonfun$toStream$1, name: 
> $outer, type: interface scala.collection.Iterator)
>   - object (class scala.collection.Iterator$$anonfun$toStream$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream(WrappedArray(1), WrappedArray(2.0), WrappedArray(NaN), 
> WrappedArray(2), WrappedArray(2)))
>   - field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$zip$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream((WrappedArray(1),(count,)), 
> (WrappedArray(2.0),(mean,)), 
> (WrappedArray(NaN),(stddev,)), 
> (WrappedArray(2),(min,)), (WrappedArray(2),(max,
>   - field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$map$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream([count,1], [mean,2.0], [stddev,NaN], [min,2], [max,2]))
>   - field (class: scala.collection.immutable.Stream$$ano

[jira] [Assigned] (SPARK-16173) Can't join describe() of DataFrame in Scala 2.10

2016-06-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16173:


Assignee: Apache Spark

> Can't join describe() of DataFrame in Scala 2.10
> 
>
> Key: SPARK-16173
> URL: https://issues.apache.org/jira/browse/SPARK-16173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> descripbe() of DataFrame use Seq() (it's a Iterator actually) to create 
> another DataFrame, which can not be serialized in Scala 2.10.
> {code}
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2060)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
>   at 
> org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.NotSerializableException: 
> scala.collection.Iterator$$anon$11
> Serialization stack:
>   - object not serializable (class: scala.collection.Iterator$$anon$11, 
> value: empty iterator)
>   - field (class: scala.collection.Iterator$$anonfun$toStream$1, name: 
> $outer, type: interface scala.collection.Iterator)
>   - object (class scala.collection.Iterator$$anonfun$toStream$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream(WrappedArray(1), WrappedArray(2.0), WrappedArray(NaN), 
> WrappedArray(2), WrappedArray(2)))
>   - field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$zip$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream((WrappedArray(1),(count,)), 
> (WrappedArray(2.0),(mean,)), 
> (WrappedArray(NaN),(stddev,)), 
> (WrappedArray(2),(min,)), (WrappedArray(2),(max,
>   - field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$map$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream([count,1], [mean,2.0], [stddev,NaN], [min,2], [max,2]))
>   - field (class: scala.collect

[jira] [Updated] (SPARK-16195) Allow users to specify empty over clause in window expressions through dataset API

2016-06-24 Thread Dilip Biswal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dilip Biswal updated SPARK-16195:
-
Description: 
In SQL, its allowed to specify an empty OVER clause in the window expression.

{code}
select area, sum(product) over () as c from windowData
where product > 3 group by area, product
having avg(month) > 0 order by avg(month), product
{code}

In this case the analytic function sum is presented based on all the rows of 
the result set

Currently its not allowed through dataset API.


  was:
In SQL, its allowed to specify an empty OVER clause in the window expression.
 
select area, sum(product) over () as c from windowData
where product > 3 group by area, product
having avg(month) > 0 order by avg(month), product

In this case the analytic function sum is presented based on all the rows of 
the result set

Currently its not allowed through dataset API.



> Allow users to specify empty over clause in window expressions through 
> dataset API
> --
>
> Key: SPARK-16195
> URL: https://issues.apache.org/jira/browse/SPARK-16195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> In SQL, its allowed to specify an empty OVER clause in the window expression.
> {code}
> select area, sum(product) over () as c from windowData
> where product > 3 group by area, product
> having avg(month) > 0 order by avg(month), product
> {code}
> In this case the analytic function sum is presented based on all the rows of 
> the result set
> Currently its not allowed through dataset API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16183) Large Spark SQL commands cause StackOverflowError in parser when using sqlContext.sql

2016-06-24 Thread Matthew Porter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348726#comment-15348726
 ] 

Matthew Porter commented on SPARK-16183:


The query has a bit of proprietary information in it so I will work on making a 
sample query and try to get back to you soon.

Now that I think about it, this isn't the first time I encountered this issue. 
A few months ago I wrote another SparkSQL generating script which created 
significantly smaller (although still large) queries containing many 'WHERE 
x.pos BETWEEN y and z' clauses. This query generated worked for hundreds of 
inputs, all except for one which required a SQL query larger than the rest to 
be generated. This query involved several hundred BETWEEN statements and 
resulted in the same recursive Parser error message.

It should be noted that the SQL query I can't post here has only one or two 
WHERE clauses - the bulk of it is INNER JOINs and selection for/calculations on 
columns. Two very different use cases result in the same error message.

> Large Spark SQL commands cause StackOverflowError in parser when using 
> sqlContext.sql
> -
>
> Key: SPARK-16183
> URL: https://issues.apache.org/jira/browse/SPARK-16183
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.1
> Environment: Running on AWS EMR
>Reporter: Matthew Porter
>
> Hi,
> I have created a PySpark SQL-based tool which auto-generates a complex SQL 
> command to be run via sqlContext.sql(cmd) based on a large number of 
> parameters. As the number of input files to be filtered and joined in this 
> query grows, so does the length of the SQL query. The tool runs fine up until 
> about 200+ files are included in the join, at which point the SQL command 
> becomes very long (~100K characters). It is only on these longer queries that 
> Spark fails, throwing an exception due to what seems to be too much recursion 
> occurring within the SparkSQL parser:
> {code}
> Traceback (most recent call last):
> ...
> merged_df = sqlsc.sql(cmd)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 
> 580, in sql
>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", 
> line 813, in __call__
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, 
> in deco
>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 
> 308, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o173.sql.
> : java.lang.StackOverflowError
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers

[jira] [Created] (SPARK-16198) Change the access level of the predict method in spark.ml.Predictor to public

2016-06-24 Thread Hussein Hazimeh (JIRA)
Hussein Hazimeh created SPARK-16198:
---

 Summary: Change the access level of the predict method in 
spark.ml.Predictor to public
 Key: SPARK-16198
 URL: https://issues.apache.org/jira/browse/SPARK-16198
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Hussein Hazimeh


h1. Summary

The transform method of predictors in spark.ml has a relatively high latency 
when predicting single instances or small batches, which is mainly due to the 
overhead introduced by DataFrame operations. For a text classification task on 
the RCV1 datatset, changing the access level of the low-level "predict" method 
from protected to public and using it to make predictions reduced the latency 
of single predictions by three to four folds and that of batches by 50%. While 
the transform method is flexible and sufficient for general usage, exposing the 
low-level predict method to the public API can benefit many applications that 
require low-latency response.

h1. Experiment
I performed an experiment to measure the latency of single instance predictions 
in Spark and some other popular ML toolkits. Specifically, I'm looking at the 
the time it takes to predict or classify a feature vector residing in memory 
after the model is trained.

For each toolkit in the table below, logistic regression was trained on the 
Reuters RCV1 dataset which contains 697,641 documents and 47,236 features 
stored in LIBSVM format along with binary labels. Then the wall-clock time 
required to classify each document in a sample of 100,000 documents is 
measured, and the 50th, 90th, and 99th percentiles and the maximum time are 
reported. 

All toolkits were tested on a desktop machine with an i7-6700 processor and 16 
GB memory, running Ubuntu 14.04 and OpenBLAS. The wall clock resolution is 80ns 
for Python and 20ns for Scala.

h1. Results
The table below shows the latency of predictions for single instances in 
milliseconds, sorted by P90. Spark and Spark 2 refer to versions 1.6.1 and 
2.0.0-SNAPSHOT (on master), respectively. In {color:blue}Spark 
(Modified){color} and {color:blue}Spark 2 (Modified){color},  I changed the 
access level of the predict method from protected to public and used it to 
perform the predictions instead of transform. 

||Toolkit||API||P50||P90||P99||Max||
|Spark|MLLIB (Scala)|0.0002|0.0015|0.0028|0.0685|
|{color:blue}Spark 2 (Modified){color}|{color:blue}ML 
(Scala){color}|0.0004|0.0031|0.0087|0.3979|
|{color:blue}Spark (Modified){color}|{color:blue}ML 
(Scala){color}|0.0013|0.0061|0.0632|0.4924|
|Spark|MLLIB (Python)|0.0065|0.0075|0.0212|0.1579|
|Scikit-Learn|Python|0.0341|0.0460|0.0849|0.2285|
|LIBLINEAR|Python|0.0669|0.1484|0.2623|1.7322|
|{color:red}Spark{color}|{color:red}ML 
(Scala){color}|2.3713|2.9577|4.6521|511.2448|
|{color:red}Spark 2{color}|{color:red}ML 
(Scala){color}|8.4603|9.4352|13.2143|292.8733|
|BIDMach (CPU)|Scala|5.4061|49.1362|102.2563|12040.6773|
|BIDMach (GPU)|Scala|471.3460|477.8214|485.9805|807.4782|

The results show that spark.mllib has the lowest latency among all other 
toolkits and APIs, and this can be attributed to its low-level prediction 
function that operates directly on the feature vector. However, spark.ml has a 
relatively high latency which is in the order of 3ms for Spark 1.6.1 and 10ms 
for Spark 2.0.0. Profiling the transform method of logistic regression in 
spark.ml showed that only 0.01% of the time is being spent in doing the dot 
product and logit transformation, while the rest of the time is dominated by 
the DataFrame operations (mostly the “withColumn” operation that appends the 
predictions column(s) to the input DataFrame). The results of the modified 
versions of spark.ml, which directly use the predict method, validate this 
observation as the latency is reduced by three to four folds.

Since Spark splits batch predictions into a series of single-instance 
predictions, reducing the latency of single predictions can lead to lower 
latencies in batch predictions. I tried batch predictions in spark.ml (1.6.1) 
using testing_features.map(x => model.predict( x)).collect() instead of 
model.transform(testing_dataframe).select(“prediction”).collect(), and the 
former had roughly 50% less latency for batches of size 1000, 10,000, and 
100,000.

Although the experiment is constrained to logistic regression, other predictors 
in the classification, regression, and clustering modules can suffer from the 
same problem as it is being caused by the overhead due to DataFrames and not by 
the model itself. Therefore, changing the access level of the predict method in 
all predictors to public, can benefit applications requiring low-latency and 
add more flexibility to programmers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsu

[jira] [Resolved] (SPARK-16179) UDF explosion yielding empty dataframe fails

2016-06-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16179.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> UDF explosion yielding empty dataframe fails
> 
>
> Key: SPARK-16179
> URL: https://issues.apache.org/jira/browse/SPARK-16179
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> Command to replicate 
> https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0
> Resulting failure
> https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16179) UDF explosion yielding empty dataframe fails

2016-06-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16179:

Fix Version/s: (was: 2.0.0)
   2.0.1

> UDF explosion yielding empty dataframe fails
> 
>
> Key: SPARK-16179
> URL: https://issues.apache.org/jira/browse/SPARK-16179
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>Assignee: Davies Liu
> Fix For: 2.0.1
>
>
> Command to replicate 
> https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0
> Resulting failure
> https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16199) Add a method to list the referenced columns in data source Filter

2016-06-24 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-16199:
---

 Summary: Add a method to list the referenced columns in data 
source Filter
 Key: SPARK-16199
 URL: https://issues.apache.org/jira/browse/SPARK-16199
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


It would be useful to support listing the columns that are referenced by a 
filter.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >