[jira] [Resolved] (SPARK-11305) Remove Third-Party Hadoop Distributions Doc Page

2015-11-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11305.
---
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 1.6.0

Resolved by https://github.com/apache/spark/pull/9298

> Remove Third-Party Hadoop Distributions Doc Page
> 
>
> Key: SPARK-11305
> URL: https://issues.apache.org/jira/browse/SPARK-11305
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Sean Owen
>Priority: Critical
> Fix For: 1.6.0
>
>
> There is a fairly old page in our docs that contains a bunch of assorted 
> information regarding running Spark on Hadoop clusters. I think this page 
> should be removed and merged into other parts of the docs because the 
> information is largely redundant and somewhat outdated.
> http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
> There are three sections:
> 1. Compile time Hadoop version - this information I think can be removed in 
> favor of that on the "building spark" page. These days most "advanced users" 
> are building without bundling Hadoop, so I'm not sure giving them a bunch of 
> different Hadoop versions sends the right message.
> 2. Linking against Hadoop - this doesn't seem to add much beyond what is in 
> the programming guide.
> 3. Where to run Spark - redundant with the hardware provisioning guide.
> 4. Inheriting cluster configurations - I think this would be better as a 
> section at the end of the configuration page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11236) Upgrade Tachyon dependency to 0.8.0

2015-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984327#comment-14984327
 ] 

Apache Spark commented on SPARK-11236:
--

User 'calvinjia' has created a pull request for this issue:
https://github.com/apache/spark/pull/9395

> Upgrade Tachyon dependency to 0.8.0
> ---
>
> Key: SPARK-11236
> URL: https://issues.apache.org/jira/browse/SPARK-11236
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Calvin Jia
>
> Update the tachyon-client dependency from 0.7.1 to 0.8.0. There are no new 
> dependencies added or Spark facing APIs changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-11-01 Thread Study Hsueh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984377#comment-14984377
 ] 

Study Hsueh commented on SPARK-11191:
-

This should be caused by builtin FunctionRegistry in 1.5.1
https://github.com/apache/spark/blob/v1.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L413

The FunctionRegistry in 1.4.1
https://github.com/apache/spark/blob/v1.4.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L377

> [1.5] Can't create UDF's using hive thrift service
> --
>
> Key: SPARK-11191
> URL: https://issues.apache.org/jira/browse/SPARK-11191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: David Ross
>
> Since upgrading to spark 1.5 we've been unable to create and use UDF's when 
> we run in thrift server mode.
> Our setup:
> We start the thrift-server running against yarn in client mode, (we've also 
> built our own spark from github branch-1.5 with the following args: {{-Pyarn 
> -Phive -Phive-thrifeserver}}
> If i run the following after connecting via JDBC (in this case via beeline):
> {{add jar 'hdfs://path/to/jar"}}
> (this command succeeds with no errors)
> {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}}
> (this command succeeds with no errors)
> {{select testUDF(col1) from table1;}}
> I get the following error in the logs:
> {code}
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 8
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
> at scala.util.Try.getOrElse(Try.scala:77)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
> {code}
> (cutting the bulk for ease of report, more than happy to send the full output)
> {code}
> 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 100
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> 

[jira] [Assigned] (SPARK-11112) DAG visualization: display RDD callsite

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2:


Assignee: Apache Spark  (was: Andrew Or)

> DAG visualization: display RDD callsite
> ---
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11112) DAG visualization: display RDD callsite

2015-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984468#comment-14984468
 ] 

Apache Spark commented on SPARK-2:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/9398

> DAG visualization: display RDD callsite
> ---
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-11-01 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11191:
-
Target Version/s: 1.5.2, 1.5.3, 1.6.0

> [1.5] Can't create UDF's using hive thrift service
> --
>
> Key: SPARK-11191
> URL: https://issues.apache.org/jira/browse/SPARK-11191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: David Ross
>
> Since upgrading to spark 1.5 we've been unable to create and use UDF's when 
> we run in thrift server mode.
> Our setup:
> We start the thrift-server running against yarn in client mode, (we've also 
> built our own spark from github branch-1.5 with the following args: {{-Pyarn 
> -Phive -Phive-thrifeserver}}
> If i run the following after connecting via JDBC (in this case via beeline):
> {{add jar 'hdfs://path/to/jar"}}
> (this command succeeds with no errors)
> {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}}
> (this command succeeds with no errors)
> {{select testUDF(col1) from table1;}}
> I get the following error in the logs:
> {code}
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 8
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
> at scala.util.Try.getOrElse(Try.scala:77)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
> {code}
> (cutting the bulk for ease of report, more than happy to send the full output)
> {code}
> 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 100
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> When I ran the same against 1.4 it worked.
> I've also changed the {{spark.sql.hive.metastore.version}} 

[jira] [Updated] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-11-01 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11191:
-
Priority: Critical  (was: Major)

> [1.5] Can't create UDF's using hive thrift service
> --
>
> Key: SPARK-11191
> URL: https://issues.apache.org/jira/browse/SPARK-11191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: David Ross
>Priority: Critical
>
> Since upgrading to spark 1.5 we've been unable to create and use UDF's when 
> we run in thrift server mode.
> Our setup:
> We start the thrift-server running against yarn in client mode, (we've also 
> built our own spark from github branch-1.5 with the following args: {{-Pyarn 
> -Phive -Phive-thrifeserver}}
> If i run the following after connecting via JDBC (in this case via beeline):
> {{add jar 'hdfs://path/to/jar"}}
> (this command succeeds with no errors)
> {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}}
> (this command succeeds with no errors)
> {{select testUDF(col1) from table1;}}
> I get the following error in the logs:
> {code}
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 8
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
> at scala.util.Try.getOrElse(Try.scala:77)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
> {code}
> (cutting the bulk for ease of report, more than happy to send the full output)
> {code}
> 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 100
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> When I ran the same against 1.4 it worked.
> I've also changed the 

[jira] [Assigned] (SPARK-11112) DAG visualization: display RDD callsite

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2:


Assignee: Andrew Or  (was: Apache Spark)

> DAG visualization: display RDD callsite
> ---
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11440) Declare rest of @Experimental items non-experimental if they've existed since 1.2.0

2015-11-01 Thread Sean Owen (JIRA)
Sean Owen created SPARK-11440:
-

 Summary: Declare rest of @Experimental items non-experimental if 
they've existed since 1.2.0
 Key: SPARK-11440
 URL: https://issues.apache.org/jira/browse/SPARK-11440
 Project: Spark
  Issue Type: Improvement
  Components: Build, Spark Core, Streaming
Affects Versions: 1.5.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor


Follow on to SPARK-11184. This removes {{Experimental}} annotations on methods 
that have existed since at least 1.2.0. That's almost entirely stuff in core 
and streaming. 

SQL experimental items are largely 1.3.0 onwards; arguably could be 
non-Experimental and happy to do that.

We've already reviewed MLlib, and ML is still properly Experimental in the main 
now. Details in the PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11401) PMML export for Logistic Regression Multiclass Classification

2015-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984441#comment-14984441
 ] 

Apache Spark commented on SPARK-11401:
--

User 'selvinsource' has created a pull request for this issue:
https://github.com/apache/spark/pull/9397

> PMML export for Logistic Regression Multiclass Classification
> -
>
> Key: SPARK-11401
> URL: https://issues.apache.org/jira/browse/SPARK-11401
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincenzo Selvaggio
>Priority: Minor
>
>  tvmanikandan requested on https://github.com/apache/spark/pull/3062 multi 
> class support for logistic regression.
> At the moment the toPMML method for Logistic Regression only supports binary 
> classification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11401) PMML export for Logistic Regression Multiclass Classification

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11401:


Assignee: (was: Apache Spark)

> PMML export for Logistic Regression Multiclass Classification
> -
>
> Key: SPARK-11401
> URL: https://issues.apache.org/jira/browse/SPARK-11401
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincenzo Selvaggio
>Priority: Minor
>
>  tvmanikandan requested on https://github.com/apache/spark/pull/3062 multi 
> class support for logistic regression.
> At the moment the toPMML method for Logistic Regression only supports binary 
> classification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11435) Stop SparkContext at the end of subtest in SparkListenerSuite

2015-11-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11435.
---
Resolution: Not A Problem

> Stop SparkContext at the end of subtest in SparkListenerSuite
> -
>
> Key: SPARK-11435
> URL: https://issues.apache.org/jira/browse/SPARK-11435
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Ted Yu
>Priority: Minor
>
> Some subtests in SparkListenerSuite creates SparkContext without stopping it 
> explicitly upon completion of the subtest.
> This issue is to stop SparkContext explicitly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11441) HadoopFsRelation is not scalable in number of files read/written

2015-11-01 Thread koert kuipers (JIRA)
koert kuipers created SPARK-11441:
-

 Summary: HadoopFsRelation is not scalable in number of files 
read/written
 Key: SPARK-11441
 URL: https://issues.apache.org/jira/browse/SPARK-11441
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: koert kuipers


HadoopFsRelation includes a fileStatusCache which holds information on all the 
datafiles (part files) for the data source in the driver program.

It is not unusual to be reading from 100k+ or even 1mm part files, in which 
case filling up this cache will take a very long time (days?) and require a lot 
of memory. See for example:
https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201510.mbox/%3CCAG+ckK-FvWK=1b2jqc4s+zaz2zvkqehvos9myo0ssmgpc5-...@mail.gmail.com%3E

This is not the kind of behavior you would expect of a driver program. Also 
HadoopFsRelation passes this large list of part files into:
def buildScan(inputFiles: Array[FileStatus]): RDD[Row]

Almost all implementations of HadoopFsRelation do the following inside 
buildScan:
FileInputFormat.setInputPaths(job, inputFiles.map(_.getPath): _*)
This means an array of potentially millions of items now gets stored in the 
JobConf which will be broadcast. I have not found any errors about this on 
mailing list but i believe this is simply because nobody with a large number of 
inputFiles has gotten this far.

Generally when using Hadoop InputFormats there should never be a need to list 
all the part files driver side. It seems the reason it is done here is to 
facilitate a process in ParquetRelation driver side that creates a merged data 
schema. I wonder if its really necessary to look at all the part files for 
this, or if some assumption can be made that at least all the part files in a 
directory have the same schema (which would reduce the size of the problem by a 
factor 100 or so).

At the very least it seems to be that the caching of files is parquet specific 
and does not belong in HadoopFsRelation. And buildScan should just use the data 
paths (so directories if one wants to read all part files in a directory) as it 
did before SPARK-7673 / PR #6225

I ran into this issue myself with spark-avro, which also does not handle the 
input of part files in buildScan well. Spark-avro actually tries to create an 
RDD (and jobConf broadcast) per part file, which is not scalable even for 1k 
part files. Note that it is difficult for spark-avro to create an RDD per data 
directory (as it probably should) since the dataPaths have been lost now that 
the inputFiles is passed into buildScan instead. This to me again confirms the 
change in buildScan is troubling.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11435) Stop SparkContext at the end of subtest in SparkListenerSuite

2015-11-01 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984422#comment-14984422
 ] 

Ted Yu commented on SPARK-11435:


LocalSparkContext would close the SparkContext.

> Stop SparkContext at the end of subtest in SparkListenerSuite
> -
>
> Key: SPARK-11435
> URL: https://issues.apache.org/jira/browse/SPARK-11435
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Ted Yu
>Priority: Minor
>
> Some subtests in SparkListenerSuite creates SparkContext without stopping it 
> explicitly upon completion of the subtest.
> This issue is to stop SparkContext explicitly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11440) Declare rest of @Experimental items non-experimental if they've existed since 1.2.0

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11440:


Assignee: Sean Owen  (was: Apache Spark)

> Declare rest of @Experimental items non-experimental if they've existed since 
> 1.2.0
> ---
>
> Key: SPARK-11440
> URL: https://issues.apache.org/jira/browse/SPARK-11440
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core, Streaming
>Affects Versions: 1.5.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> Follow on to SPARK-11184. This removes {{Experimental}} annotations on 
> methods that have existed since at least 1.2.0. That's almost entirely stuff 
> in core and streaming. 
> SQL experimental items are largely 1.3.0 onwards; arguably could be 
> non-Experimental and happy to do that.
> We've already reviewed MLlib, and ML is still properly Experimental in the 
> main now. Details in the PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11440) Declare rest of @Experimental items non-experimental if they've existed since 1.2.0

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11440:


Assignee: Apache Spark  (was: Sean Owen)

> Declare rest of @Experimental items non-experimental if they've existed since 
> 1.2.0
> ---
>
> Key: SPARK-11440
> URL: https://issues.apache.org/jira/browse/SPARK-11440
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core, Streaming
>Affects Versions: 1.5.1
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> Follow on to SPARK-11184. This removes {{Experimental}} annotations on 
> methods that have existed since at least 1.2.0. That's almost entirely stuff 
> in core and streaming. 
> SQL experimental items are largely 1.3.0 onwards; arguably could be 
> non-Experimental and happy to do that.
> We've already reviewed MLlib, and ML is still properly Experimental in the 
> main now. Details in the PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11440) Declare rest of @Experimental items non-experimental if they've existed since 1.2.0

2015-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984426#comment-14984426
 ] 

Apache Spark commented on SPARK-11440:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9396

> Declare rest of @Experimental items non-experimental if they've existed since 
> 1.2.0
> ---
>
> Key: SPARK-11440
> URL: https://issues.apache.org/jira/browse/SPARK-11440
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core, Streaming
>Affects Versions: 1.5.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> Follow on to SPARK-11184. This removes {{Experimental}} annotations on 
> methods that have existed since at least 1.2.0. That's almost entirely stuff 
> in core and streaming. 
> SQL experimental items are largely 1.3.0 onwards; arguably could be 
> non-Experimental and happy to do that.
> We've already reviewed MLlib, and ML is still properly Experimental in the 
> main now. Details in the PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11338) HistoryPage not multi-tenancy enabled (app links not prefixed with APPLICATION_WEB_PROXY_BASE)

2015-11-01 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11338.

   Resolution: Fixed
 Assignee: Christian Kadner
Fix Version/s: 1.6.0

> HistoryPage not multi-tenancy enabled (app links not prefixed with 
> APPLICATION_WEB_PROXY_BASE)
> --
>
> Key: SPARK-11338
> URL: https://issues.apache.org/jira/browse/SPARK-11338
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Christian Kadner
>Assignee: Christian Kadner
> Fix For: 1.6.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Links on {{HistoryPage}} are not prepended with {{uiRoot}} ({{export 
> APPLICATION_WEB_PROXY_BASE=}}). This makes it 
> impossible/unpractical to expose the *History Server* in a multi-tenancy 
> environment where each Spark service instance has one history server behind a 
> multi-tenant enabled proxy server.  All other Spark web UI pages are 
> correctly prefixed when the {{APPLICATION_WEB_PROXY_BASE}} environment 
> variable is set.
> *Repro steps:*\\
> # Configure history log collection:
> {code:title=conf/spark-defaults.conf|borderStyle=solid}
> spark.eventLog.enabled true
> spark.eventLog.dir logs/history
> spark.history.fs.logDirectory  logs/history
> {code}
> ...create the logs folders:
> {code}
> $ mkdir -p logs/history
> {code}
> # Start the Spark shell and run the word count example:
> {code:java|borderStyle=solid}
> $ bin/spark-shell
> ...
> scala> sc.textFile("README.md").flatMap(_.split(" ")).map(w => (w, 
> 1)).reduceByKey(_ + _).collect
> scala> sc.stop
> {code}
> # Set the web proxy root path path (i.e. {{/testwebuiproxy/..}}):
> {code}
> $ export APPLICATION_WEB_PROXY_BASE=/testwebuiproxy/..
> {code}
> # Start the history server:
> {code}
> $  sbin/start-history-server.sh
> {code}
> # Bring up the History Server web UI at {{localhost:18080}} and view the 
> application link in the HTML source text:
> {code:xml|borderColor=#c00}
> ...
> App IDApp 
> Name...
>   
> 
>href="/history/local-1445896187531">local-1445896187531Spark 
> shell
>   ...
> {code}
> *Notice*, application link "{{/history/local-1445896187531}}" does _not_ have 
> the prefix {{/testwebuiproxy/..}} \\ \\
> All site-relative links (URL starting with {{"/"}}) should have been 
> prepended with the uiRoot prefix {{/testwebuiproxy/..}} like this ...
> {code:xml|borderColor=#0c0}
> ...
> App IDApp 
> Name...
>   
> 
>href="/testwebuiproxy/../history/local-1445896187531">local-1445896187531Spark
>  shell
>   ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11444) Allow bacth seqOp combination in treeReduce

2015-11-01 Thread holdenk (JIRA)
holdenk created SPARK-11444:
---

 Summary: Allow bacth seqOp combination in treeReduce
 Key: SPARK-11444
 URL: https://issues.apache.org/jira/browse/SPARK-11444
 Project: Spark
  Issue Type: Improvement
  Components: ML, Spark Core
Reporter: holdenk
Priority: Minor


Allow batch seqOp in treeReduce so as to allow better integration with GPU type 
workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10929) Tungsten fails to acquire memory writing to HDFS

2015-11-01 Thread Naden Franciscus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984615#comment-14984615
 ] 

Naden Franciscus edited comment on SPARK-10929 at 11/2/15 12:48 AM:


Can confirm this issue has been resolved.

Any reason we can't target this for 1.5.3 ? 

It is a serious bug and it would mean that any HDP distros that eventually 
deploy Spark 1.5 would be unusable for many of us.


was (Author: nadenf):
Can confirm this issue has been resolved.

Any reason we can't target this for 1.5.3 ? 

It is a serious bug and it would mean that any HDP distros that use Spark 1.5 
would be unusable for many of us.

> Tungsten fails to acquire memory writing to HDFS
> 
>
> Key: SPARK-10929
> URL: https://issues.apache.org/jira/browse/SPARK-10929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Naden Franciscus
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.6.0
>
>
> We are executing 20 Spark SQL jobs in parallel using Spark Job Server and 
> hitting the following issue pretty routinely.
> 40GB heap x 6 nodes. Have tried adjusting shuffle.memoryFraction from 0.2 -> 
> 0.1 with no difference. 
> {code}
> .16): org.apache.spark.SparkException: Task failed while writing rows.
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:250)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Unable to acquire 16777216 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.prepare(MapPartitionsWithPreparationRDD.scala:50)
> at 
> org.apache.spark.rdd.ZippedPartitionsBaseRDD$$anonfun$tryPrepareParents$1.applyOrElse(ZippedPartitionsRDD.scala:83)
> at 
> org.apache.spark.rdd.ZippedPartitionsBaseRDD$$anonfun$tryPrepareParents$1.applyOrElse(ZippedPartitionsRDD.scala:82)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.collect(TraversableLike.scala:278)
> at scala.collection.AbstractTraversable.collect(Traversable.scala:105)
> at 
> org.apache.spark.rdd.ZippedPartitionsBaseRDD.tryPrepareParents(ZippedPartitionsRDD.scala:82)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:97)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> {code}
> I have tried setting spark.buffer.pageSize to both 1MB and 64MB (in 
> spark-defaults.conf) and it makes no difference.
> It also tries to acquire 33554432 bytes of memory in both cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10929) Tungsten fails to acquire memory writing to HDFS

2015-11-01 Thread Naden Franciscus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984615#comment-14984615
 ] 

Naden Franciscus commented on SPARK-10929:
--

Can confirm this issue has been resolved.

Any reason we can't target this for 1.5.3 ? 

It is a serious bug and it would mean that any HDP distros that use Spark 1.5 
would be unusable for many of us.

> Tungsten fails to acquire memory writing to HDFS
> 
>
> Key: SPARK-10929
> URL: https://issues.apache.org/jira/browse/SPARK-10929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Naden Franciscus
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.6.0
>
>
> We are executing 20 Spark SQL jobs in parallel using Spark Job Server and 
> hitting the following issue pretty routinely.
> 40GB heap x 6 nodes. Have tried adjusting shuffle.memoryFraction from 0.2 -> 
> 0.1 with no difference. 
> {code}
> .16): org.apache.spark.SparkException: Task failed while writing rows.
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:250)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Unable to acquire 16777216 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.prepare(MapPartitionsWithPreparationRDD.scala:50)
> at 
> org.apache.spark.rdd.ZippedPartitionsBaseRDD$$anonfun$tryPrepareParents$1.applyOrElse(ZippedPartitionsRDD.scala:83)
> at 
> org.apache.spark.rdd.ZippedPartitionsBaseRDD$$anonfun$tryPrepareParents$1.applyOrElse(ZippedPartitionsRDD.scala:82)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.collect(TraversableLike.scala:278)
> at scala.collection.AbstractTraversable.collect(Traversable.scala:105)
> at 
> org.apache.spark.rdd.ZippedPartitionsBaseRDD.tryPrepareParents(ZippedPartitionsRDD.scala:82)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:97)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> {code}
> I have tried setting spark.buffer.pageSize to both 1MB and 64MB (in 
> spark-defaults.conf) and it makes no difference.
> It also tries to acquire 33554432 bytes of memory in both cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11445) Replace example code in mllib-ensembles.md using include_example

2015-11-01 Thread Gabor Liptak (JIRA)
Gabor Liptak created SPARK-11445:


 Summary: Replace example code in mllib-ensembles.md using 
include_example
 Key: SPARK-11445
 URL: https://issues.apache.org/jira/browse/SPARK-11445
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Reporter: Gabor Liptak






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11131) Worker registration protocol is racy

2015-11-01 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11131.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 1.6.0

> Worker registration protocol is racy
> 
>
> Key: SPARK-11131
> URL: https://issues.apache.org/jira/browse/SPARK-11131
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 1.6.0
>
>
> I ran into this while making changes to the new RPC framework. Because the 
> Worker registration protocol is based on sending unrelated messages between 
> Master and Worker, it's possible that another message (e.g. caused by an a 
> app trying to allocate workers) to arrive at the Worker before it knows the 
> Master has registered it. This triggers the following code:
> {code}
> case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
>   if (masterUrl != activeMasterUrl) {
> logWarning("Invalid Master (" + masterUrl + ") attempted to launch 
> executor.")
> {code}
> This may or may not be made worse by SPARK-11098.
> A simple workaround is to use an {{ask}} instead of a {{send}} for these 
> messages. That should at least narrow the race. 
> Note this is more of a problem in {{local-cluster}} mode, used a lot by unit 
> tests, where Master and Worker instances are coming up as part of the app 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11382) Replace example code in mllib-decision-tree.md using include_example

2015-11-01 Thread Gabor Liptak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Liptak updated SPARK-11382:
-
Summary: Replace example code in mllib-decision-tree.md using 
include_example  (was: Replace example code in 
mllib-decision-tree.md/mllib-ensembles.md using include_example)

> Replace example code in mllib-decision-tree.md using include_example
> 
>
> Key: SPARK-11382
> URL: https://issues.apache.org/jira/browse/SPARK-11382
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>
> This is similar to SPARK-11289 but for the example code in 
> mllib-frequent-pattern-mining.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11446) Spark 1.6 release notes

2015-11-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11446:

Target Version/s: 1.6.0

> Spark 1.6 release notes
> ---
>
> Key: SPARK-11446
> URL: https://issues.apache.org/jira/browse/SPARK-11446
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>Priority: Critical
>
> This is a staging location where we can keep track of changes that need to be 
> documented in the release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11446) Spark 1.6 release notes

2015-11-01 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-11446:
---

 Summary: Spark 1.6 release notes
 Key: SPARK-11446
 URL: https://issues.apache.org/jira/browse/SPARK-11446
 Project: Spark
  Issue Type: Task
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Michael Armbrust
Priority: Critical


This is a staging location where we can keep track of changes that need to be 
documented in the release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11238) SparkR: Documentation change for merge function

2015-11-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984646#comment-14984646
 ] 

Patrick Wendell commented on SPARK-11238:
-

I created SPARK-11446 and linked it here.

> SparkR: Documentation change for merge function
> ---
>
> Key: SPARK-11238
> URL: https://issues.apache.org/jira/browse/SPARK-11238
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>  Labels: releasenotes
>
> As discussed in pull request: https://github.com/apache/spark/pull/9012, the 
> signature of the merge function will be changed, therefore documentation 
> change is required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11198) Support record de-aggregation in KinesisReceiver

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11198:


Assignee: Burak Yavuz  (was: Apache Spark)

> Support record de-aggregation in KinesisReceiver
> 
>
> Key: SPARK-11198
> URL: https://issues.apache.org/jira/browse/SPARK-11198
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>
> We need to check/implement the support for record de-aggregation and 
> subsequence number. This is the documentation from AWS:
> http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-kpl-consumer-deaggregation.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11198) Support record de-aggregation in KinesisReceiver

2015-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984625#comment-14984625
 ] 

Apache Spark commented on SPARK-11198:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/9403

> Support record de-aggregation in KinesisReceiver
> 
>
> Key: SPARK-11198
> URL: https://issues.apache.org/jira/browse/SPARK-11198
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>
> We need to check/implement the support for record de-aggregation and 
> subsequence number. This is the documentation from AWS:
> http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-kpl-consumer-deaggregation.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11198) Support record de-aggregation in KinesisReceiver

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11198:


Assignee: Apache Spark  (was: Burak Yavuz)

> Support record de-aggregation in KinesisReceiver
> 
>
> Key: SPARK-11198
> URL: https://issues.apache.org/jira/browse/SPARK-11198
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> We need to check/implement the support for record de-aggregation and 
> subsequence number. This is the documentation from AWS:
> http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-kpl-consumer-deaggregation.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9298) corr aggregate functions

2015-11-01 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-9298.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8587
[https://github.com/apache/spark/pull/8587]

> corr aggregate functions
> 
>
> Key: SPARK-9298
> URL: https://issues.apache.org/jira/browse/SPARK-9298
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Liang-Chi Hsieh
> Fix For: 1.6.0
>
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11447) Null comparison requires type information but type extraction fails for complex types

2015-11-01 Thread Kapil Singh (JIRA)
Kapil Singh created SPARK-11447:
---

 Summary: Null comparison requires type information but type 
extraction fails for complex types
 Key: SPARK-11447
 URL: https://issues.apache.org/jira/browse/SPARK-11447
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Kapil Singh


While comparing a Column to a null literal, comparison works only if type of 
null literal matches type of the Column it's being compared to. Example scala 
code (can be run from spark shell):


import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.expressions._

val inputRowsData = Seq(Seq("abc"),Seq(null),Seq("xyz"))
val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
val dfSchema = StructType(Seq(StructField("column", StringType, true)))
val df = sqlContext.createDataFrame(sc.makeRDD(inputRows), dfSchema)

//DOESN'T WORK
val filteredDF = df.filter(df("column") <=> (new Column(Literal(null

//WORKS
val filteredDF = df.filter(df("column") <=> (new Column(Literal.create(null, 
SparkleFunctions.dataType(df("column"))


Why should type information be required for a null comparison? If it's 
required, it's not always possible to extract type information from complex  
types (e.g. StructType). Following scala code (can be run from spark shell), 
throws org.apache.spark.sql.catalyst.analysis.UnresolvedException:


import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.expressions._

val inputRowsData = Seq(Seq(Row.fromSeq(Seq("abc", 
"def"))),Seq(Row.fromSeq(Seq(null, "123"))),Seq(Row.fromSeq(Seq("ghi", "jkl"
val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
val dfSchema = StructType(Seq(StructField("column", 
StructType(Seq(StructField("p1", StringType, true), StructField("p2", 
StringType, true))), true)))

val filteredDF = df.filter(df("column")("p1") <=> (new 
Column(Literal.create(null, SparkleFunctions.dataType(df("column")("p1"))

org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: column#0[p1]
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedExtractValue.dataType(unresolved.scala:243)
at 
org.apache.spark.sql.ArithmeticFunctions$class.dataType(ArithmeticFunctions.scala:76)
at 
org.apache.spark.sql.SparkleFunctions$.dataType(SparkleFunctions.scala:14)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:59)
at $iwC$$iwC$$iwC$$iwC.(:61)
at $iwC$$iwC$$iwC.(:63)
at $iwC$$iwC.(:65)
at $iwC.(:67)
at (:69)
at .(:73)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 

[jira] [Assigned] (SPARK-11443) Blank line reserved include_example

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11443:


Assignee: (was: Apache Spark)

> Blank line reserved include_example
> ---
>
> Key: SPARK-11443
> URL: https://issues.apache.org/jira/browse/SPARK-11443
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>
> The trim_codeblock(lines) function in include_example.rb removes some blank 
> lines in the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11020) HistoryServer fails to come up if HDFS takes too long to come out of safe mode

2015-11-01 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11020.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 1.6.0

> HistoryServer fails to come up if HDFS takes too long to come out of safe mode
> --
>
> Key: SPARK-11020
> URL: https://issues.apache.org/jira/browse/SPARK-11020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.6.0
>
>
> When HDFS is starting up, it starts in safe mode until the NN is able to read 
> the whole fs image and initialize everything. For a really large NN that can 
> take a while.
> If the HS is started at the same time, it may give up trying to check whether 
> the event log directory exists, and exit. That's a little sub-optimal; the HS 
> could wait until HDFS came out of safe mode instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10387) Code generation for decision tree

2015-11-01 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984601#comment-14984601
 ] 

holdenk commented on SPARK-10387:
-

I can do some work to help with this :)

> Code generation for decision tree
> -
>
> Key: SPARK-10387
> URL: https://issues.apache.org/jira/browse/SPARK-10387
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>
> Provide code generation for decision tree and tree ensembles. Let's first 
> discuss the design and then create new JIRAs for tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-11-01 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11191:
-
Priority: Blocker  (was: Critical)

> [1.5] Can't create UDF's using hive thrift service
> --
>
> Key: SPARK-11191
> URL: https://issues.apache.org/jira/browse/SPARK-11191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: David Ross
>Priority: Blocker
>
> Since upgrading to spark 1.5 we've been unable to create and use UDF's when 
> we run in thrift server mode.
> Our setup:
> We start the thrift-server running against yarn in client mode, (we've also 
> built our own spark from github branch-1.5 with the following args: {{-Pyarn 
> -Phive -Phive-thrifeserver}}
> If i run the following after connecting via JDBC (in this case via beeline):
> {{add jar 'hdfs://path/to/jar"}}
> (this command succeeds with no errors)
> {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}}
> (this command succeeds with no errors)
> {{select testUDF(col1) from table1;}}
> I get the following error in the logs:
> {code}
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 8
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
> at scala.util.Try.getOrElse(Try.scala:77)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
> {code}
> (cutting the bulk for ease of report, more than happy to send the full output)
> {code}
> 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 100
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> When I ran the same against 1.4 it worked.
> I've also changed the 

[jira] [Created] (SPARK-11443) Blank line reserved include_example

2015-11-01 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-11443:
-

 Summary: Blank line reserved include_example
 Key: SPARK-11443
 URL: https://issues.apache.org/jira/browse/SPARK-11443
 Project: Spark
  Issue Type: Sub-task
Reporter: Xusen Yin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11407) Add documentation on using SparkR from RStudio

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11407:


Assignee: Apache Spark

> Add documentation on using SparkR from RStudio
> --
>
> Key: SPARK-11407
> URL: https://issues.apache.org/jira/browse/SPARK-11407
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
>
> As per [~shivaram] we need to add a section in the programming guide on using 
> SparkR from RStudio, in which we should talk about:
> - how to load SparkR package
> - what configurable options for initializing SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11407) Add documentation on using SparkR from RStudio

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11407:


Assignee: (was: Apache Spark)

> Add documentation on using SparkR from RStudio
> --
>
> Key: SPARK-11407
> URL: https://issues.apache.org/jira/browse/SPARK-11407
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Priority: Minor
>
> As per [~shivaram] we need to add a section in the programming guide on using 
> SparkR from RStudio, in which we should talk about:
> - how to load SparkR package
> - what configurable options for initializing SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11407) Add documentation on using SparkR from RStudio

2015-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984575#comment-14984575
 ] 

Apache Spark commented on SPARK-11407:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/9401

> Add documentation on using SparkR from RStudio
> --
>
> Key: SPARK-11407
> URL: https://issues.apache.org/jira/browse/SPARK-11407
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Priority: Minor
>
> As per [~shivaram] we need to add a section in the programming guide on using 
> SparkR from RStudio, in which we should talk about:
> - how to load SparkR package
> - what configurable options for initializing SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11410) Add a DataFrame API that provides functionality similar to HiveQL's DISTRIBUTE BY

2015-11-01 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11410.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

It has been resolved by https://github.com/apache/spark/pull/9364.

> Add a DataFrame API that provides functionality similar to HiveQL's 
> DISTRIBUTE BY
> -
>
> Key: SPARK-11410
> URL: https://issues.apache.org/jira/browse/SPARK-11410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Nong Li
>Assignee: Nong Li
> Fix For: 1.6.0
>
>
> DISTRIBUTE BY allows the user to control the partitioning and ordering of a 
> data set which can be very useful for some applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11410) Add a DataFrame API that provides functionality similar to HiveQL's DISTRIBUTE BY

2015-11-01 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11410:
-
Assignee: Nong Li

> Add a DataFrame API that provides functionality similar to HiveQL's 
> DISTRIBUTE BY
> -
>
> Key: SPARK-11410
> URL: https://issues.apache.org/jira/browse/SPARK-11410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Nong Li
>Assignee: Nong Li
> Fix For: 1.6.0
>
>
> DISTRIBUTE BY allows the user to control the partitioning and ordering of a 
> data set which can be very useful for some applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11073) Remove akka dependency from SecurityManager

2015-11-01 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11073.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 1.6.0

> Remove akka dependency from SecurityManager
> ---
>
> Key: SPARK-11073
> URL: https://issues.apache.org/jira/browse/SPARK-11073
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 1.6.0
>
>
> {{SecurityManager::generateSecretKey}} currently uses akka to generate a 
> secret for the app. We should remove that dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984548#comment-14984548
 ] 

Apache Spark commented on SPARK-10978:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/9399

> Allow PrunedFilterScan to eliminate predicates from further evaluation
> --
>
> Key: SPARK-10978
> URL: https://issues.apache.org/jira/browse/SPARK-10978
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Russell Alexander Spitzer
>Priority: Critical
>
> Currently PrunedFilterScan allows implementors to push down predicates to an 
> underlying datasource. This is done solely as an optimization as the 
> predicate will be reapplied on the Spark side as well. This allows for 
> bloom-filter like operations but ends up doing a redundant scan for those 
> sources which can do accurate pushdowns.
> In addition it makes it difficult for underlying sources to accept queries 
> which reference non-existent to provide ancillary function. In our case we 
> allow a solr query to be passed in via a non-existent solr_query column. 
> Since this column is not returned when Spark does a filter on "solr_query" 
> nothing passes. 
> Suggestion on the ML from [~marmbrus] 
> {quote}
> We have to try and maintain binary compatibility here, so probably the 
> easiest thing to do here would be to add a method to the class.  Perhaps 
> something like:
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
> By default, this could return all filters so behavior would remain the same, 
> but specific implementations could override it.  There is still a chance that 
> this would conflict with existing methods, but hopefully that would not be a 
> problem in practice.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10978:


Assignee: Apache Spark

> Allow PrunedFilterScan to eliminate predicates from further evaluation
> --
>
> Key: SPARK-10978
> URL: https://issues.apache.org/jira/browse/SPARK-10978
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Russell Alexander Spitzer
>Assignee: Apache Spark
>Priority: Critical
>
> Currently PrunedFilterScan allows implementors to push down predicates to an 
> underlying datasource. This is done solely as an optimization as the 
> predicate will be reapplied on the Spark side as well. This allows for 
> bloom-filter like operations but ends up doing a redundant scan for those 
> sources which can do accurate pushdowns.
> In addition it makes it difficult for underlying sources to accept queries 
> which reference non-existent to provide ancillary function. In our case we 
> allow a solr query to be passed in via a non-existent solr_query column. 
> Since this column is not returned when Spark does a filter on "solr_query" 
> nothing passes. 
> Suggestion on the ML from [~marmbrus] 
> {quote}
> We have to try and maintain binary compatibility here, so probably the 
> easiest thing to do here would be to add a method to the class.  Perhaps 
> something like:
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
> By default, this could return all filters so behavior would remain the same, 
> but specific implementations could override it.  There is still a chance that 
> this would conflict with existing methods, but hopefully that would not be a 
> problem in practice.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10978:


Assignee: (was: Apache Spark)

> Allow PrunedFilterScan to eliminate predicates from further evaluation
> --
>
> Key: SPARK-10978
> URL: https://issues.apache.org/jira/browse/SPARK-10978
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Russell Alexander Spitzer
>Priority: Critical
>
> Currently PrunedFilterScan allows implementors to push down predicates to an 
> underlying datasource. This is done solely as an optimization as the 
> predicate will be reapplied on the Spark side as well. This allows for 
> bloom-filter like operations but ends up doing a redundant scan for those 
> sources which can do accurate pushdowns.
> In addition it makes it difficult for underlying sources to accept queries 
> which reference non-existent to provide ancillary function. In our case we 
> allow a solr query to be passed in via a non-existent solr_query column. 
> Since this column is not returned when Spark does a filter on "solr_query" 
> nothing passes. 
> Suggestion on the ML from [~marmbrus] 
> {quote}
> We have to try and maintain binary compatibility here, so probably the 
> easiest thing to do here would be to add a method to the class.  Perhaps 
> something like:
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
> By default, this could return all filters so behavior would remain the same, 
> but specific implementations could override it.  There is still a chance that 
> this would conflict with existing methods, but hopefully that would not be a 
> problem in practice.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6001) K-Means clusterer should return the assignments of input points to clusters

2015-11-01 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984574#comment-14984574
 ] 

Yu Ishikawa commented on SPARK-6001:


[~josephkb] can we close this issue? 

> K-Means clusterer should return the assignments of input points to clusters
> ---
>
> Key: SPARK-6001
> URL: https://issues.apache.org/jira/browse/SPARK-6001
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.1
>Reporter: Derrick Burns
>Priority: Minor
>
> The K-Means clusterer returns a KMeansModel that contains the cluster 
> centers. However, when available, I suggest that the K-Means clusterer also 
> return an RDD of the assignments of the input data to the clusters. While the 
> assignments can be computed given the KMeansModel, why not return assignments 
> if they are available to save re-computation costs.
> The K-means implementation at 
> https://github.com/derrickburns/generalized-kmeans-clustering returns the 
> assignments when available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11275) [SQL] Regression in rollup/cube

2015-11-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984539#comment-14984539
 ] 

Xiao Li commented on SPARK-11275:
-

A simple fix can resolve this issue by using subquery. Thus, trying to modify 
the analyzer.  

> [SQL] Regression in rollup/cube 
> 
>
> Key: SPARK-11275
> URL: https://issues.apache.org/jira/browse/SPARK-11275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Xiao Li
>
> Spark SQL is unable to generate a correct result when the following query 
> using rollup. 
> "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, 
> b with rollup"
> Spark SQL generates a wrong result:
> [2,4,6,3]
> [2,null,null,1]
> [1,null,null,1]
> [null,null,null,0]
> [1,2,3,3]
> The table mytable is super simple, containing two rows and two columns:
> testData = Seq((1, 2), (2, 4)).toDF("a", "b")
> After turning off codegen, the query plan is like 
> == Parsed Logical Plan ==
> 'Rollup ['a,'b], 
> [unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS 
> sumAB#20),unresolvedalias('GROUPING__ID)]
>  'UnresolvedRelation `mytable`, None
> == Analyzed Logical Plan ==
> a: int, b: int, sumAB: bigint, GROUPING__ID: int
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   Subquery mytable
>Project [_1#0 AS a#2,_2#1 AS b#3]
> LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
> == Optimized Logical Plan ==
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   LocalRelation [a#2,b#3], [[1,2],[2,4]]
> == Physical Plan ==
> Aggregate false, [a#2,b#3,grouping__id#23], [a#2,b#3,sum(PartialSum#24L) AS 
> sumAB#20L,grouping__id#23]
>  Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
>   Aggregate true, [a#2,b#3,grouping__id#23], 
> [a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L]
>Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], 
> [a#2,b#3,grouping__id#23]
> LocalTableScan [a#2,b#3], [[1,2],[2,4]]
> Below are my observations:
> 1. Generation of GROUP__ID looks OK. 
> 2. The problem still exists no matter whether turning on/off CODEGEN
> 3. Rollup still works in a simple query when group-by columns have only one 
> column. For example, "select b, sum(a), GROUPING__ID from mytable group by b 
> with rollup"
> 4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. 
> Unfortunately, they hide the bugs. Although the buckets passed, they just 
> compare the results of SQL and Dataframe. This way is unable to capture the 
> regression when both return the same wrong results.  
> 5. The same problem also exists in cube. I have not started the investigation 
> in cube, but I believe the root causes should be the same. 
> 6. It looks like all the logical plans are correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11442) Reduce numSlices for local metrics test of SparkListenerSuite

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11442:


Assignee: Apache Spark

> Reduce numSlices for local metrics test of SparkListenerSuite
> -
>
> Key: SPARK-11442
> URL: https://issues.apache.org/jira/browse/SPARK-11442
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Ted Yu
>Assignee: Apache Spark
>Priority: Minor
>
> In the thread, 
> http://search-hadoop.com/m/q3RTtcQiFSlTxeP/test+failed+due+to+OOME=test+failed+due+to+OOME,
>  it was discussed that memory consumption for SparkListenerSuite should be 
> brought down.
> This is an attempt in that direction by reducing numSlices for local metrics 
> test.
> Before change:
> Run completed in 57 seconds, 357 milliseconds.
> Reducing numSlices to 16 results in:
> Run completed in 44 seconds, 115 milliseconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11442) Reduce numSlices for local metrics test of SparkListenerSuite

2015-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984553#comment-14984553
 ] 

Apache Spark commented on SPARK-11442:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9384

> Reduce numSlices for local metrics test of SparkListenerSuite
> -
>
> Key: SPARK-11442
> URL: https://issues.apache.org/jira/browse/SPARK-11442
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Ted Yu
>Priority: Minor
>
> In the thread, 
> http://search-hadoop.com/m/q3RTtcQiFSlTxeP/test+failed+due+to+OOME=test+failed+due+to+OOME,
>  it was discussed that memory consumption for SparkListenerSuite should be 
> brought down.
> This is an attempt in that direction by reducing numSlices for local metrics 
> test.
> Before change:
> Run completed in 57 seconds, 357 milliseconds.
> Reducing numSlices to 16 results in:
> Run completed in 44 seconds, 115 milliseconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11442) Reduce numSlices for local metrics test of SparkListenerSuite

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11442:


Assignee: (was: Apache Spark)

> Reduce numSlices for local metrics test of SparkListenerSuite
> -
>
> Key: SPARK-11442
> URL: https://issues.apache.org/jira/browse/SPARK-11442
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Ted Yu
>Priority: Minor
>
> In the thread, 
> http://search-hadoop.com/m/q3RTtcQiFSlTxeP/test+failed+due+to+OOME=test+failed+due+to+OOME,
>  it was discussed that memory consumption for SparkListenerSuite should be 
> brought down.
> This is an attempt in that direction by reducing numSlices for local metrics 
> test.
> Before change:
> Run completed in 57 seconds, 357 milliseconds.
> Reducing numSlices to 16 results in:
> Run completed in 44 seconds, 115 milliseconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11442) Reduce numSlices for local metrics test of SparkListenerSuite

2015-11-01 Thread Ted Yu (JIRA)
Ted Yu created SPARK-11442:
--

 Summary: Reduce numSlices for local metrics test of 
SparkListenerSuite
 Key: SPARK-11442
 URL: https://issues.apache.org/jira/browse/SPARK-11442
 Project: Spark
  Issue Type: Test
  Components: Tests
Reporter: Ted Yu
Priority: Minor


In the thread, 
http://search-hadoop.com/m/q3RTtcQiFSlTxeP/test+failed+due+to+OOME=test+failed+due+to+OOME,
 it was discussed that memory consumption for SparkListenerSuite should be 
brought down.

This is an attempt in that direction by reducing numSlices for local metrics 
test.

Before change:

Run completed in 57 seconds, 357 milliseconds.

Reducing numSlices to 16 results in:

Run completed in 44 seconds, 115 milliseconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11443) Blank line reserved include_example

2015-11-01 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-11443:
--
Description: The trim_codeblock(lines) function in include_example.rb 
removes some blank lines in the code.

> Blank line reserved include_example
> ---
>
> Key: SPARK-11443
> URL: https://issues.apache.org/jira/browse/SPARK-11443
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>
> The trim_codeblock(lines) function in include_example.rb removes some blank 
> lines in the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11443) Blank line reserved include_example

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11443:


Assignee: Apache Spark

> Blank line reserved include_example
> ---
>
> Key: SPARK-11443
> URL: https://issues.apache.org/jira/browse/SPARK-11443
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Apache Spark
>
> The trim_codeblock(lines) function in include_example.rb removes some blank 
> lines in the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11443) Blank line reserved include_example

2015-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984567#comment-14984567
 ] 

Apache Spark commented on SPARK-11443:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9400

> Blank line reserved include_example
> ---
>
> Key: SPARK-11443
> URL: https://issues.apache.org/jira/browse/SPARK-11443
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>
> The trim_codeblock(lines) function in include_example.rb removes some blank 
> lines in the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9722) Pass random seed to spark.ml DecisionTree*

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9722:
---

Assignee: (was: Apache Spark)

> Pass random seed to spark.ml DecisionTree*
> --
>
> Key: SPARK-9722
> URL: https://issues.apache.org/jira/browse/SPARK-9722
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> Trees use XORShiftRandom when binning continuous features.  Currently, they 
> use a fixed seed of 1.  They should accept a random seed param and use that 
> instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9722) Pass random seed to spark.ml DecisionTree*

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9722:
---

Assignee: Apache Spark

> Pass random seed to spark.ml DecisionTree*
> --
>
> Key: SPARK-9722
> URL: https://issues.apache.org/jira/browse/SPARK-9722
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Trivial
>
> Trees use XORShiftRandom when binning continuous features.  Currently, they 
> use a fixed seed of 1.  They should accept a random seed param and use that 
> instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9722) Pass random seed to spark.ml DecisionTree*

2015-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984593#comment-14984593
 ] 

Apache Spark commented on SPARK-9722:
-

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/9402

> Pass random seed to spark.ml DecisionTree*
> --
>
> Key: SPARK-9722
> URL: https://issues.apache.org/jira/browse/SPARK-9722
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> Trees use XORShiftRandom when binning continuous features.  Currently, they 
> use a fixed seed of 1.  They should accept a random seed param and use that 
> instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-11-01 Thread Naden Franciscus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984614#comment-14984614
 ] 

Naden Franciscus commented on SPARK-10474:
--

Can confirm this issue has been resolved. Nice work.

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-

[jira] [Commented] (SPARK-11446) Spark 1.6 release notes

2015-11-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984776#comment-14984776
 ] 

Patrick Wendell commented on SPARK-11446:
-

I think this is redundant with the "releasenotes" tag so I am closing it.

> Spark 1.6 release notes
> ---
>
> Key: SPARK-11446
> URL: https://issues.apache.org/jira/browse/SPARK-11446
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>Priority: Critical
>
> This is a staging location where we can keep track of changes that need to be 
> documented in the release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11446) Spark 1.6 release notes

2015-11-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell closed SPARK-11446.
---
Resolution: Invalid

> Spark 1.6 release notes
> ---
>
> Key: SPARK-11446
> URL: https://issues.apache.org/jira/browse/SPARK-11446
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>Priority: Critical
>
> This is a staging location where we can keep track of changes that need to be 
> documented in the release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5354) Set InMemoryColumnarTableScan's outputPartitioning and outputOrdering

2015-11-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984753#comment-14984753
 ] 

Apache Spark commented on SPARK-5354:
-

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/9404

> Set InMemoryColumnarTableScan's outputPartitioning and outputOrdering
> -
>
> Key: SPARK-5354
> URL: https://issues.apache.org/jira/browse/SPARK-5354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> *Updated*
> Right now, Spark SQL is not aware of the outputPartitioning and 
> outputOrdering of a InMemoryColumnarTableScan. Actually, we can just inherit 
> these two properties from the {{SparkPlan}} of the cached table.
> *Original*
> Right now, Spark SQL is not aware of the partitioning scheme of a leaf 
> SparkPlan (e.g. an input table). So, even users want to re-partitioning the 
> data in advance, Exchange operators will still be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5354) Set InMemoryColumnarTableScan's outputPartitioning and outputOrdering

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5354:
---

Assignee: Apache Spark

> Set InMemoryColumnarTableScan's outputPartitioning and outputOrdering
> -
>
> Key: SPARK-5354
> URL: https://issues.apache.org/jira/browse/SPARK-5354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> *Updated*
> Right now, Spark SQL is not aware of the outputPartitioning and 
> outputOrdering of a InMemoryColumnarTableScan. Actually, we can just inherit 
> these two properties from the {{SparkPlan}} of the cached table.
> *Original*
> Right now, Spark SQL is not aware of the partitioning scheme of a leaf 
> SparkPlan (e.g. an input table). So, even users want to re-partitioning the 
> data in advance, Exchange operators will still be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5354) Set InMemoryColumnarTableScan's outputPartitioning and outputOrdering

2015-11-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5354:
---

Assignee: (was: Apache Spark)

> Set InMemoryColumnarTableScan's outputPartitioning and outputOrdering
> -
>
> Key: SPARK-5354
> URL: https://issues.apache.org/jira/browse/SPARK-5354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> *Updated*
> Right now, Spark SQL is not aware of the outputPartitioning and 
> outputOrdering of a InMemoryColumnarTableScan. Actually, we can just inherit 
> these two properties from the {{SparkPlan}} of the cached table.
> *Original*
> Right now, Spark SQL is not aware of the partitioning scheme of a leaf 
> SparkPlan (e.g. an input table). So, even users want to re-partitioning the 
> data in advance, Exchange operators will still be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-11-01 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984809#comment-14984809
 ] 

Guoqiang Li edited comment on SPARK-5575 at 11/2/15 6:59 AM:
-

Hi All,
There is a MLP implementation with the Parameter Server and Spark.

[MLP.scala|https://github.com/witgo/zen/blob/ps_mlp/ml/src/main/scala/com/github/cloudml/zen/ml/parameterserver/neuralNetwork/MLP.scala]



was (Author: gq):
Hi All,
There is a MLP implementation with the Parameter Server and Spark.

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10597) MultivariateOnlineSummarizer for weighted instances

2015-11-01 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-10597:

Target Version/s:   (was: 1.6.0)

> MultivariateOnlineSummarizer for weighted instances
> ---
>
> Key: SPARK-10597
> URL: https://issues.apache.org/jira/browse/SPARK-10597
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>
> MultivariateOnlineSummarizer for weighted instances is implemented as private 
> API for SPARK-7685.
> In SPARK-7685, the online numerical stable version of unbiased estimation of 
> variance defined by the reliability weights: 
> [[https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights]]
>  is implemented, but we would like to make it as public api since there are 
> different use-cases.
> Currently, `count` will return the actual number of instances, and ignores 
> instance weights, but `numNonzeros` will return the weighted # of nonzeros. 
> We need to decide the behavior of them before making it public.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11275) [SQL] Regression in rollup/cube

2015-11-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984814#comment-14984814
 ] 

Xiao Li commented on SPARK-11275:
-

My fix is ready. Now, trying to add the test cases. Hopefully, I can finish all 
of them tomorrow. 

> [SQL] Regression in rollup/cube 
> 
>
> Key: SPARK-11275
> URL: https://issues.apache.org/jira/browse/SPARK-11275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Xiao Li
>
> Spark SQL is unable to generate a correct result when the following query 
> using rollup. 
> "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, 
> b with rollup"
> Spark SQL generates a wrong result:
> [2,4,6,3]
> [2,null,null,1]
> [1,null,null,1]
> [null,null,null,0]
> [1,2,3,3]
> The table mytable is super simple, containing two rows and two columns:
> testData = Seq((1, 2), (2, 4)).toDF("a", "b")
> After turning off codegen, the query plan is like 
> == Parsed Logical Plan ==
> 'Rollup ['a,'b], 
> [unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS 
> sumAB#20),unresolvedalias('GROUPING__ID)]
>  'UnresolvedRelation `mytable`, None
> == Analyzed Logical Plan ==
> a: int, b: int, sumAB: bigint, GROUPING__ID: int
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   Subquery mytable
>Project [_1#0 AS a#2,_2#1 AS b#3]
> LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
> == Optimized Logical Plan ==
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   LocalRelation [a#2,b#3], [[1,2],[2,4]]
> == Physical Plan ==
> Aggregate false, [a#2,b#3,grouping__id#23], [a#2,b#3,sum(PartialSum#24L) AS 
> sumAB#20L,grouping__id#23]
>  Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
>   Aggregate true, [a#2,b#3,grouping__id#23], 
> [a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L]
>Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], 
> [a#2,b#3,grouping__id#23]
> LocalTableScan [a#2,b#3], [[1,2],[2,4]]
> Below are my observations:
> 1. Generation of GROUP__ID looks OK. 
> 2. The problem still exists no matter whether turning on/off CODEGEN
> 3. Rollup still works in a simple query when group-by columns have only one 
> column. For example, "select b, sum(a), GROUPING__ID from mytable group by b 
> with rollup"
> 4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. 
> Unfortunately, they hide the bugs. Although the buckets passed, they just 
> compare the results of SQL and Dataframe. This way is unable to capture the 
> regression when both return the same wrong results.  
> 5. The same problem also exists in cube. I have not started the investigation 
> in cube, but I believe the root causes should be the same. 
> 6. It looks like all the logical plans are correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9722) Pass random seed to spark.ml DecisionTree*

2015-11-01 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-9722.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9402
[https://github.com/apache/spark/pull/9402]

> Pass random seed to spark.ml DecisionTree*
> --
>
> Key: SPARK-9722
> URL: https://issues.apache.org/jira/browse/SPARK-9722
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yu Ishikawa
>Priority: Trivial
> Fix For: 1.6.0
>
>
> Trees use XORShiftRandom when binning continuous features.  Currently, they 
> use a fixed seed of 1.  They should accept a random seed param and use that 
> instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11340) Support setting driver properties when starting Spark from R programmatically or from RStudio

2015-11-01 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984778#comment-14984778
 ] 

Sun Rui commented on SPARK-11340:
-

I have posted the message about this enhancement in the mailing list to the 
user who reported this issue.

> Support setting driver properties when starting Spark from R programmatically 
> or from RStudio
> -
>
> Key: SPARK-11340
> URL: https://issues.apache.org/jira/browse/SPARK-11340
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 1.6.0
>
>
> Currently when sparkR.init() is called in 'client' mode, it launches the JVM 
> backend but driver properties (like driver-memory) are not passed or settable 
> by the user calling sparkR.init().
> [~sunrui][~shivaram] and I discussed this offline and think we should support 
> this.
> This is the original thread:
> >> From: rui@intel.com
> >> To: dirceu.semigh...@gmail.com
> >> CC: u...@spark.apache.org
> >> Subject: RE: How to set memory for SparkR with master="local[*]"
> >> Date: Mon, 26 Oct 2015 02:24:00 +
> >>
> >> As documented in
> >> http://spark.apache.org/docs/latest/configuration.html#available-prop
> >> e
> >> rties,
> >>
> >> Note for “spark.driver.memory”:
> >>
> >> Note: In client mode, this config must not be set through the 
> >> SparkConf directly in your application, because the driver JVM has 
> >> already started at that point. Instead, please set this through the 
> >> --driver-memory command line option or in your default properties file.
> >>
> >>
> >>
> >> If you are to start a SparkR shell using bin/sparkR, then you can use 
> >> bin/sparkR –driver-memory. You have no chance to set the driver 
> >> memory size after the R shell has been launched via bin/sparkR.
> >>
> >>
> >>
> >> Buf if you are to start a SparkR shell manually without using 
> >> bin/sparkR (for example, in Rstudio), you can:
> >>
> >> library(SparkR)
> >>
> >> Sys.setenv("SPARKR_SUBMIT_ARGS" = "--conf spark.driver.memory=2g
> >> sparkr-shell")
> >>
> >> sc <- sparkR.init()
> >>
> >>
> >>
> >> From: Dirceu Semighini Filho [mailto:dirceu.semigh...@gmail.com]
> >> Sent: Friday, October 23, 2015 7:53 PM
> >> Cc: user
> >> Subject: Re: How to set memory for SparkR with master="local[*]"
> >>
> >>
> >>
> >> Hi Matej,
> >>
> >> I'm also using this and I'm having the same behavior here, my driver 
> >> has only 530mb which is the default value.
> >>
> >>
> >>
> >> Maybe this is a bug.
> >>
> >>
> >>
> >> 2015-10-23 9:43 GMT-02:00 Matej Holec :
> >>
> >> Hello!
> >>
> >> How to adjust the memory settings properly for SparkR with 
> >> master="local[*]"
> >> in R?
> >>
> >>
> >> *When running from  R -- SparkR doesn't accept memory settings :(*
> >>
> >> I use the following commands:
> >>
> >> R>  library(SparkR)
> >> R>  sc <- sparkR.init(master = "local[*]", sparkEnvir =
> >> list(spark.driver.memory = "5g"))
> >>
> >> Despite the variable spark.driver.memory is correctly set (checked in 
> >> http://node:4040/environment/), the driver has only the default 
> >> amount of memory allocated (Storage Memory 530.3 MB).
> >>
> >> *But when running from  spark-1.5.1-bin-hadoop2.6/bin/sparkR -- OK*
> >>
> >> The following command:
> >>
> >> ]$ spark-1.5.1-bin-hadoop2.6/bin/sparkR --driver-memory 5g
> >>
> >> creates SparkR session with properly adjustest driver memory (Storage 
> >> Memory
> >> 2.6 GB).
> >>
> >>
> >> Any suggestion?
> >>
> >> Thanks
> >> Matej
> >>
> >>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-11-01 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984809#comment-14984809
 ] 

Guoqiang Li commented on SPARK-5575:


Hi All,
There is a MLP implementation with the Parameter Server and Spark.

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9722) Pass random seed to spark.ml DecisionTree*

2015-11-01 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-9722:
---
Assignee: Yu Ishikawa

> Pass random seed to spark.ml DecisionTree*
> --
>
> Key: SPARK-9722
> URL: https://issues.apache.org/jira/browse/SPARK-9722
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yu Ishikawa
>Priority: Trivial
>
> Trees use XORShiftRandom when binning continuous features.  Currently, they 
> use a fixed seed of 1.  They should accept a random seed param and use that 
> instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10810) Improve session management for SQL

2015-11-01 Thread Rex Xiong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984758#comment-14984758
 ] 

Rex Xiong commented on SPARK-10810:
---

Is it possible to port this change to 1.5.x?
It's a blocking issue for us to adopt 1.5
Or when will branch-1.6 be created?

Thanks

> Improve session management for SQL
> --
>
> Key: SPARK-10810
> URL: https://issues.apache.org/jira/browse/SPARK-10810
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.6.0
>
> Attachments: Session management in Spark SQL 1.6.pdf
>
>
> Currently, we try to support multiple sessions in SQL within a Spark Context, 
> but it's broken and not complete.
> We should isolate these for each session :
> 1) current database of Hive
> 2) SQLConf
> 3) UDF/UDAF/UDTF
> 4) temporary table
> For added jar and cached tables, they should be accessible for all sessions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-11-01 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984804#comment-14984804
 ] 

Wenchen Fan commented on SPARK-11191:
-

The behaviour of "ADD JAR" has not been changed from 1.4 to 1.5, and "CREATE 
FUNCTION" is a native command that we will run it using hive client.  So I 
think the change of built-in FunctionRegistry is not the reason of this bug, 
I'll look into it.

> [1.5] Can't create UDF's using hive thrift service
> --
>
> Key: SPARK-11191
> URL: https://issues.apache.org/jira/browse/SPARK-11191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: David Ross
>Priority: Blocker
>
> Since upgrading to spark 1.5 we've been unable to create and use UDF's when 
> we run in thrift server mode.
> Our setup:
> We start the thrift-server running against yarn in client mode, (we've also 
> built our own spark from github branch-1.5 with the following args: {{-Pyarn 
> -Phive -Phive-thrifeserver}}
> If i run the following after connecting via JDBC (in this case via beeline):
> {{add jar 'hdfs://path/to/jar"}}
> (this command succeeds with no errors)
> {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}}
> (this command succeeds with no errors)
> {{select testUDF(col1) from table1;}}
> I get the following error in the logs:
> {code}
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 8
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
> at scala.util.Try.getOrElse(Try.scala:77)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
> {code}
> (cutting the bulk for ease of report, more than happy to send the full output)
> {code}
> 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 100
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> 

[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-11-01 Thread zhengbing li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984853#comment-14984853
 ] 

zhengbing li commented on SPARK-5575:
-

Hi Guoqiang
 Which Parameter Server do you use? Opensource or private?

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org