[jira] [Commented] (SPARK-11595) "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/"

2015-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996611#comment-14996611
 ] 

Apache Spark commented on SPARK-11595:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/9570

> "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" 
> and "hdfs:/"
> ---
>
> Key: SPARK-11595
> URL: https://issues.apache.org/jira/browse/SPARK-11595
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>
> When handling {{ADD JAR}}, Spark constructs a {{java.io.File}} first using 
> the input jar path, and then converts it into a URL 
> ([here|https://github.com/apache/spark/pull/8909/files#diff-d613c921507243c65591c003a348f5f3R180]).
>  This works file for local file path without a URL scheme (e.g. 
> {{/tmp/a.jar}}). However, {{java.io.File.toURI}} returns unexpected result 
> when given a path containing a URL scheme (e.g. {{file:///tmp/a.jar}} or 
> {{hdfs://host:9000/path/to/a.jar}}):
> {noformat}
> scala> new java.io.File("file:///tmp/file").toURI
> res1: java.net.URI = 
> file:/Users/lian/local/src/spark/workspace-a/file:/tmp/file
> {noformat}
> The consequence is that, although the {{ADD JAR}} command doesn't fail 
> immediately, the jar is actually not added properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7334) Implement RandomProjection for Dimensionality Reduction

2015-11-09 Thread Sebastian Alfers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996470#comment-14996470
 ] 

Sebastian Alfers commented on SPARK-7334:
-

It this still relevant? [~josephkb] 

I saw a discussion about LSH here: 
https://issues.apache.org/jira/browse/SPARK-5992

> Implement RandomProjection for Dimensionality Reduction
> ---
>
> Key: SPARK-7334
> URL: https://issues.apache.org/jira/browse/SPARK-7334
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Sebastian Alfers
>Priority: Minor
>
> Implement RandomProjection (RP) for dimensionality reduction
> RP is a popular approach to reduce the amount of data while preserving a 
> reasonable amount of information (pairwise distance) of you data [1][2]
> - [1] http://www.yaroslavvb.com/papers/achlioptas-database.pdf
> - [2] 
> http://people.inf.elte.hu/fekete/algoritmusok_msc/dimenzio_csokkentes/randon_projection_kdd.pdf
> I compared different implementations of that algorithm:
> - https://github.com/sebastian-alfers/random-projection-python



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11595) "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/"

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11595:


Assignee: Apache Spark  (was: Cheng Lian)

> "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" 
> and "hdfs:/"
> ---
>
> Key: SPARK-11595
> URL: https://issues.apache.org/jira/browse/SPARK-11595
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Blocker
>
> When handling {{ADD JAR}}, PR #8909 constructs a {{java.io.File}} first using 
> the input jar path, and then converts it into a URL 
> ([here|https://github.com/apache/spark/pull/8909/files#diff-d613c921507243c65591c003a348f5f3R180]).
>  This works file for local file path without a URL scheme (e.g. 
> {{/tmp/a.jar}}). However, {{java.io.File.toURI}} returns unexpected result 
> when given a path containing a URL scheme (e.g. {{file:///tmp/a.jar}} or 
> {{hdfs://host:9000/path/to/a.jar}}):
> {noformat}
> scala> new java.io.File("file:///tmp/file").toURI
> res1: java.net.URI = 
> file:/Users/lian/local/src/spark/workspace-a/file:/tmp/file
> {noformat}
> The consequence is that, although the {{ADD JAR}} command doesn't fail 
> immediately, the jar is actually not added properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11590) use native json_tuple in lateral view

2015-11-09 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996490#comment-14996490
 ] 

Wenchen Fan commented on SPARK-11590:
-

https://github.com/apache/spark/pull/9562

> use native json_tuple in lateral view
> -
>
> Key: SPARK-11590
> URL: https://issues.apache.org/jira/browse/SPARK-11590
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11590) use native json_tuple in lateral view

2015-11-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-11590:

Comment: was deleted

(was: https://github.com/apache/spark/pull/9562)

> use native json_tuple in lateral view
> -
>
> Key: SPARK-11590
> URL: https://issues.apache.org/jira/browse/SPARK-11590
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11595) "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/"

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11595:


Assignee: Cheng Lian  (was: Apache Spark)

> "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" 
> and "hdfs:/"
> ---
>
> Key: SPARK-11595
> URL: https://issues.apache.org/jira/browse/SPARK-11595
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>
> When handling {{ADD JAR}}, PR #8909 constructs a {{java.io.File}} first using 
> the input jar path, and then converts it into a URL 
> ([here|https://github.com/apache/spark/pull/8909/files#diff-d613c921507243c65591c003a348f5f3R180]).
>  This works file for local file path without a URL scheme (e.g. 
> {{/tmp/a.jar}}). However, {{java.io.File.toURI}} returns unexpected result 
> when given a path containing a URL scheme (e.g. {{file:///tmp/a.jar}} or 
> {{hdfs://host:9000/path/to/a.jar}}):
> {noformat}
> scala> new java.io.File("file:///tmp/file").toURI
> res1: java.net.URI = 
> file:/Users/lian/local/src/spark/workspace-a/file:/tmp/file
> {noformat}
> The consequence is that, although the {{ADD JAR}} command doesn't fail 
> immediately, the jar is actually not added properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11595) "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/"

2015-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996527#comment-14996527
 ] 

Apache Spark commented on SPARK-11595:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/9569

> "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" 
> and "hdfs:/"
> ---
>
> Key: SPARK-11595
> URL: https://issues.apache.org/jira/browse/SPARK-11595
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>
> When handling {{ADD JAR}}, PR #8909 constructs a {{java.io.File}} first using 
> the input jar path, and then converts it into a URL 
> ([here|https://github.com/apache/spark/pull/8909/files#diff-d613c921507243c65591c003a348f5f3R180]).
>  This works file for local file path without a URL scheme (e.g. 
> {{/tmp/a.jar}}). However, {{java.io.File.toURI}} returns unexpected result 
> when given a path containing a URL scheme (e.g. {{file:///tmp/a.jar}} or 
> {{hdfs://host:9000/path/to/a.jar}}):
> {noformat}
> scala> new java.io.File("file:///tmp/file").toURI
> res1: java.net.URI = 
> file:/Users/lian/local/src/spark/workspace-a/file:/tmp/file
> {noformat}
> The consequence is that, although the {{ADD JAR}} command doesn't fail 
> immediately, the jar is actually not added properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11530) Return eigenvalues with PCA model

2015-11-09 Thread Christos Iraklis Tsatsoulis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christos Iraklis Tsatsoulis updated SPARK-11530:

Component/s: MLlib

> Return eigenvalues with PCA model
> -
>
> Key: SPARK-11530
> URL: https://issues.apache.org/jira/browse/SPARK-11530
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.1
>Reporter: Christos Iraklis Tsatsoulis
>
> For data scientists & statisticians, PCA is of little use if they cannot 
> estimate the _proportion of variance explained_ by selecting _k_ principal 
> components (see here for the math details: 
> https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section 
> 'Explained variance'). To estimate this, one only needs the eigenvalues of 
> the covariance matrix.
> Although the eigenvalues are currently computed during PCA model fitting, 
> they are not _returned_; hence, as it stands now, PCA in Spark ML is of 
> extremely limited practical use.
> For details, see these SO questions
> http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/
>  (pyspark)
> http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala)
> and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-11-09 Thread Kristina Plazonic (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996610#comment-14996610
 ] 

Kristina Plazonic commented on SPARK-10309:
---

Did anybody find a solution for this? I also get a lot of these errors (well 
into my job running, arghhh). 

> Some tasks failed with Unable to acquire memory
> ---
>
> Key: SPARK-10309
> URL: https://issues.apache.org/jira/browse/SPARK-10309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> *=== Update ===*
> This is caused by a mismatch between 
> `Runtime.getRuntime.availableProcessors()` and the number of active tasks in 
> `ShuffleMemoryManager`. A quick reproduction is the following:
> {code}
> // My machine only has 8 cores
> $ bin/spark-shell --master local[32]
> scala> val df = sc.parallelize(Seq((1, 1), (2, 2))).toDF("a", "b")
> scala> df.as("x").join(df.as("y"), $"x.a" === $"y.a").count()
> Caused by: java.io.IOException: Unable to acquire 2097152 bytes of memory
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:120)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$2.apply(sort.scala:143)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$2.apply(sort.scala:143)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.prepare(MapPartitionsWithPreparationRDD.scala:50)
> {code}
> *=== Original ===*
> While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on 
> executor):
> {code}
> java.io.IOException: Unable to acquire 33554432 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The task could finished after retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-11-09 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996519#comment-14996519
 ] 

Cheng Lian commented on SPARK-11191:


One of the problem here is SPARK-11595. However, after fixing SPARK-11595, 
{{CREATE TEMPORARY FUNCTION}} still doesn't work properly. Still investigating.

> [1.5] Can't create UDF's using hive thrift service
> --
>
> Key: SPARK-11191
> URL: https://issues.apache.org/jira/browse/SPARK-11191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: David Ross
>Priority: Blocker
>
> Since upgrading to spark 1.5 we've been unable to create and use UDF's when 
> we run in thrift server mode.
> Our setup:
> We start the thrift-server running against yarn in client mode, (we've also 
> built our own spark from github branch-1.5 with the following args: {{-Pyarn 
> -Phive -Phive-thrifeserver}}
> If i run the following after connecting via JDBC (in this case via beeline):
> {{add jar 'hdfs://path/to/jar"}}
> (this command succeeds with no errors)
> {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}}
> (this command succeeds with no errors)
> {{select testUDF(col1) from table1;}}
> I get the following error in the logs:
> {code}
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 8
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
> at scala.util.Try.getOrElse(Try.scala:77)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
> {code}
> (cutting the bulk for ease of report, more than happy to send the full output)
> {code}
> 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 100
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> 

[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2015-11-09 Thread Michel Lemay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996667#comment-14996667
 ] 

Michel Lemay commented on SPARK-10528:
--

Looks like the problem is still there with precompiled binaries 1.5.1

I created and chmod using winutils.exe as explained earlier:
from /tmp:
winutils.exe ls hive 
drwxrwxrwx

Furthermore, hadoop LocalFileSystem does not seems to be able to change 
permissions as shown here:

`import org.apache.hadoop.fs._
val path = new Path("file:/tmp/hive")
val lfs = FileSystem.get(path.toUri(), sc.hadoopConfiguration)
lfs.getFileStatus(path).getPermission()`

Shows: res0: org.apache.hadoop.fs.permission.FsPermission = rw-rw-rw-

`lfs.setPermission(path, new 
org.apache.hadoop.fs.permission.FsPermission(0777.toShort))
lfs.getFileStatus(new Path("file:/tmp/hive")).getPermission()`

Still shows rw-rw-rw-




> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11218) `./sbin/start-slave.sh --help` should print out the help message

2015-11-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11218.
---
   Resolution: Fixed
Fix Version/s: 1.6.0
   1.7.0

Issue resolved by pull request 9432
[https://github.com/apache/spark/pull/9432]

> `./sbin/start-slave.sh --help` should print out the help message
> 
>
> Key: SPARK-11218
> URL: https://issues.apache.org/jira/browse/SPARK-11218
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Reporter: Jacek Laskowski
>Priority: Minor
> Fix For: 1.7.0, 1.6.0
>
>
> Reading the sources has showed that the command {{./sbin/start-slave.sh 
> --help}} should print out the help message. It doesn't really.
> {code}
> ➜  spark git:(master) ✗ ./sbin/start-slave.sh --help
> starting org.apache.spark.deploy.worker.Worker, logging to 
> /Users/jacek/dev/oss/spark/sbin/../logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> failed to launch org.apache.spark.deploy.worker.Worker:
> --properties-file FILE   Path to a custom Spark properties file.
>  Default is conf/spark-defaults.conf.
> full log in 
> /Users/jacek/dev/oss/spark/sbin/../logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11595) "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/"

2015-11-09 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11595:
--

 Summary: "ADD JAR" doesn't work if the given path contains URL 
scheme like "file:/" and "hdfs:/"
 Key: SPARK-11595
 URL: https://issues.apache.org/jira/browse/SPARK-11595
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2, 1.6.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker


When handling {{ADD JAR}}, PR #8909 constructs a {{java.io.File}} first using 
the input jar path, and then converts it into a URL 
([here|https://github.com/apache/spark/pull/8909/files#diff-d613c921507243c65591c003a348f5f3R180]).
 This works file for local file path without a URL scheme (e.g. 
{{/tmp/a.jar}}). However, {{java.io.File.toURI}} returns unexpected result when 
given a path containing a URL scheme (e.g. {{file:///tmp/a.jar}} or 
{{hdfs://host:9000/path/to/a.jar}}):
{noformat}
scala> new java.io.File("file:///tmp/file").toURI
res1: java.net.URI = file:/Users/lian/local/src/spark/workspace-a/file:/tmp/file
{noformat}
The consequence is that, although the {{ADD JAR}} command doesn't fail 
immediately, the jar is actually not added properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11594) Cannot create UDAF in REPL

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11594:


Assignee: (was: Apache Spark)

> Cannot create UDAF in REPL
> --
>
> Key: SPARK-11594
> URL: https://issues.apache.org/jira/browse/SPARK-11594
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
> Environment: Latest Spark Master
> JVM 1.8.0_66-b17
>Reporter: Herman van Hovell
>Priority: Minor
>
> If you try to define the a UDAF in the REPL, an internal error is thrown by 
> Java. The following code for example:
> {noformat}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{DataType, LongType, StructType}
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> class LongProductSum extends UserDefinedAggregateFunction {
>   def inputSchema: StructType = new StructType()
> .add("a", LongType)
> .add("b", LongType)
>   def bufferSchema: StructType = new StructType()
> .add("product", LongType)
>   def dataType: DataType = LongType
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> buffer(0) = 0L
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> if (!(input.isNullAt(0) || input.isNullAt(1))) {
>   buffer(0) = buffer.getLong(0) + input.getLong(0) * input.getLong(1)
> }
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
>   }
>   def evaluate(buffer: Row): Any =
> buffer.getLong(0)
> }
> sqlContext.udf.register("longProductSum", new LongProductSum)
> val data2 = Seq[(Integer, Integer, Integer)](
>   (1, 10, -10),
>   (null, -60, 60),
>   (1, 30, -30),
>   (1, 30, 30),
>   (2, 1, 1),
>   (3, null, null)).toDF("key", "value1", "value2")
> data2.registerTempTable("agg2")
> val q = sqlContext.sql("""
> |SELECT
> |  key,
> |  count(distinct value1, value2),
> |  longProductSum(distinct value1, value2)
> |FROM agg2
> |GROUP BY key
> """.stripMargin)
> q.show
> {noformat}
> Will throw the following error:
> {noformat}
> java.lang.InternalError: Malformed class name
>   at java.lang.Class.getSimpleName(Class.java:1330)
>   at 
> org.apache.spark.sql.execution.aggregate.ScalaUDAF.toString(udaf.scala:455)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$$anonfun$9.apply(SparkStrategies.scala:211)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$$anonfun$9.apply(SparkStrategies.scala:209)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:209)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:445)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:51)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:49)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:56)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2092)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1419)
>   at 

[jira] [Assigned] (SPARK-11594) Cannot create UDAF in REPL

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11594:


Assignee: Apache Spark

> Cannot create UDAF in REPL
> --
>
> Key: SPARK-11594
> URL: https://issues.apache.org/jira/browse/SPARK-11594
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
> Environment: Latest Spark Master
> JVM 1.8.0_66-b17
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Minor
>
> If you try to define the a UDAF in the REPL, an internal error is thrown by 
> Java. The following code for example:
> {noformat}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{DataType, LongType, StructType}
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> class LongProductSum extends UserDefinedAggregateFunction {
>   def inputSchema: StructType = new StructType()
> .add("a", LongType)
> .add("b", LongType)
>   def bufferSchema: StructType = new StructType()
> .add("product", LongType)
>   def dataType: DataType = LongType
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> buffer(0) = 0L
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> if (!(input.isNullAt(0) || input.isNullAt(1))) {
>   buffer(0) = buffer.getLong(0) + input.getLong(0) * input.getLong(1)
> }
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
>   }
>   def evaluate(buffer: Row): Any =
> buffer.getLong(0)
> }
> sqlContext.udf.register("longProductSum", new LongProductSum)
> val data2 = Seq[(Integer, Integer, Integer)](
>   (1, 10, -10),
>   (null, -60, 60),
>   (1, 30, -30),
>   (1, 30, 30),
>   (2, 1, 1),
>   (3, null, null)).toDF("key", "value1", "value2")
> data2.registerTempTable("agg2")
> val q = sqlContext.sql("""
> |SELECT
> |  key,
> |  count(distinct value1, value2),
> |  longProductSum(distinct value1, value2)
> |FROM agg2
> |GROUP BY key
> """.stripMargin)
> q.show
> {noformat}
> Will throw the following error:
> {noformat}
> java.lang.InternalError: Malformed class name
>   at java.lang.Class.getSimpleName(Class.java:1330)
>   at 
> org.apache.spark.sql.execution.aggregate.ScalaUDAF.toString(udaf.scala:455)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$$anonfun$9.apply(SparkStrategies.scala:211)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$$anonfun$9.apply(SparkStrategies.scala:209)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:209)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:445)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:51)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:49)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:56)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2092)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1419)
> 

[jira] [Commented] (SPARK-11594) Cannot create UDAF in REPL

2015-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996518#comment-14996518
 ] 

Apache Spark commented on SPARK-11594:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/9568

> Cannot create UDAF in REPL
> --
>
> Key: SPARK-11594
> URL: https://issues.apache.org/jira/browse/SPARK-11594
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
> Environment: Latest Spark Master
> JVM 1.8.0_66-b17
>Reporter: Herman van Hovell
>Priority: Minor
>
> If you try to define the a UDAF in the REPL, an internal error is thrown by 
> Java. The following code for example:
> {noformat}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{DataType, LongType, StructType}
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> class LongProductSum extends UserDefinedAggregateFunction {
>   def inputSchema: StructType = new StructType()
> .add("a", LongType)
> .add("b", LongType)
>   def bufferSchema: StructType = new StructType()
> .add("product", LongType)
>   def dataType: DataType = LongType
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> buffer(0) = 0L
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> if (!(input.isNullAt(0) || input.isNullAt(1))) {
>   buffer(0) = buffer.getLong(0) + input.getLong(0) * input.getLong(1)
> }
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
>   }
>   def evaluate(buffer: Row): Any =
> buffer.getLong(0)
> }
> sqlContext.udf.register("longProductSum", new LongProductSum)
> val data2 = Seq[(Integer, Integer, Integer)](
>   (1, 10, -10),
>   (null, -60, 60),
>   (1, 30, -30),
>   (1, 30, 30),
>   (2, 1, 1),
>   (3, null, null)).toDF("key", "value1", "value2")
> data2.registerTempTable("agg2")
> val q = sqlContext.sql("""
> |SELECT
> |  key,
> |  count(distinct value1, value2),
> |  longProductSum(distinct value1, value2)
> |FROM agg2
> |GROUP BY key
> """.stripMargin)
> q.show
> {noformat}
> Will throw the following error:
> {noformat}
> java.lang.InternalError: Malformed class name
>   at java.lang.Class.getSimpleName(Class.java:1330)
>   at 
> org.apache.spark.sql.execution.aggregate.ScalaUDAF.toString(udaf.scala:455)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$$anonfun$9.apply(SparkStrategies.scala:211)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$$anonfun$9.apply(SparkStrategies.scala:209)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:209)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:445)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:51)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:49)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:56)
>   at 

[jira] [Comment Edited] (SPARK-11530) Return eigenvalues with PCA model

2015-11-09 Thread Christos Iraklis Tsatsoulis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996533#comment-14996533
 ] 

Christos Iraklis Tsatsoulis edited comment on SPARK-11530 at 11/9/15 1:40 PM:
--

I edited it to target both; there are `PCA.scala` scripts for both ML & MLLib, 
but since I am using it via PySpark, where it is available only via ML, I 
initially omitted MLlib


was (Author: ctsats):
I edited it to target both; there are PCA.scala scripts for both 
ML & MLLib, but since I am using it via PySpark, where it is available only via 
ML, I initially omitted MLlib

> Return eigenvalues with PCA model
> -
>
> Key: SPARK-11530
> URL: https://issues.apache.org/jira/browse/SPARK-11530
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.1
>Reporter: Christos Iraklis Tsatsoulis
>
> For data scientists & statisticians, PCA is of little use if they cannot 
> estimate the _proportion of variance explained_ by selecting _k_ principal 
> components (see here for the math details: 
> https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section 
> 'Explained variance'). To estimate this, one only needs the eigenvalues of 
> the covariance matrix.
> Although the eigenvalues are currently computed during PCA model fitting, 
> they are not _returned_; hence, as it stands now, PCA in Spark ML is of 
> extremely limited practical use.
> For details, see these SO questions
> http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/
>  (pyspark)
> http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala)
> and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11218) `./sbin/start-slave.sh --help` should print out the help message

2015-11-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11218:
--
Assignee: Charles Yeh

> `./sbin/start-slave.sh --help` should print out the help message
> 
>
> Key: SPARK-11218
> URL: https://issues.apache.org/jira/browse/SPARK-11218
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Reporter: Jacek Laskowski
>Assignee: Charles Yeh
>Priority: Minor
> Fix For: 1.6.0, 1.7.0
>
>
> Reading the sources has showed that the command {{./sbin/start-slave.sh 
> --help}} should print out the help message. It doesn't really.
> {code}
> ➜  spark git:(master) ✗ ./sbin/start-slave.sh --help
> starting org.apache.spark.deploy.worker.Worker, logging to 
> /Users/jacek/dev/oss/spark/sbin/../logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> failed to launch org.apache.spark.deploy.worker.Worker:
> --properties-file FILE   Path to a custom Spark properties file.
>  Default is conf/spark-defaults.conf.
> full log in 
> /Users/jacek/dev/oss/spark/sbin/../logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11530) Return eigenvalues with PCA model

2015-11-09 Thread Christos Iraklis Tsatsoulis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996533#comment-14996533
 ] 

Christos Iraklis Tsatsoulis edited comment on SPARK-11530 at 11/9/15 1:45 PM:
--

I edited it to target both; there are PCA.scala scripts for both ML & MLLib, 
but since I am using it via PySpark, where it is available only via ML, I 
initially omitted MLlib


was (Author: ctsats):
I edited it to target both; there are `PCA.scala` scripts for both ML & MLLib, 
but since I am using it via PySpark, where it is available only via ML, I 
initially omitted MLlib

> Return eigenvalues with PCA model
> -
>
> Key: SPARK-11530
> URL: https://issues.apache.org/jira/browse/SPARK-11530
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.1
>Reporter: Christos Iraklis Tsatsoulis
>
> For data scientists & statisticians, PCA is of little use if they cannot 
> estimate the _proportion of variance explained_ by selecting _k_ principal 
> components (see here for the math details: 
> https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section 
> 'Explained variance'). To estimate this, one only needs the eigenvalues of 
> the covariance matrix.
> Although the eigenvalues are currently computed during PCA model fitting, 
> they are not _returned_; hence, as it stands now, PCA in Spark ML is of 
> extremely limited practical use.
> For details, see these SO questions
> http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/
>  (pyspark)
> http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala)
> and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11530) Return eigenvalues with PCA model

2015-11-09 Thread Christos Iraklis Tsatsoulis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996533#comment-14996533
 ] 

Christos Iraklis Tsatsoulis edited comment on SPARK-11530 at 11/9/15 1:50 PM:
--

I edited it to target both; there are PCA.scala scripts for both ML & MLlib, 
but since I am using it via PySpark, where it is available only via ML, I 
initially omitted MLlib.


was (Author: ctsats):
I edited it to target both; there are PCA.scala scripts for both ML & MLLib, 
but since I am using it via PySpark, where it is available only via ML, I 
initially omitted MLlib

> Return eigenvalues with PCA model
> -
>
> Key: SPARK-11530
> URL: https://issues.apache.org/jira/browse/SPARK-11530
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.1
>Reporter: Christos Iraklis Tsatsoulis
>
> For data scientists & statisticians, PCA is of little use if they cannot 
> estimate the _proportion of variance explained_ by selecting _k_ principal 
> components (see here for the math details: 
> https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section 
> 'Explained variance'). To estimate this, one only needs the eigenvalues of 
> the covariance matrix.
> Although the eigenvalues are currently computed during PCA model fitting, 
> they are not _returned_; hence, as it stands now, PCA in Spark ML is of 
> extremely limited practical use.
> For details, see these SO questions
> http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/
>  (pyspark)
> http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala)
> and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11595) "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/"

2015-11-09 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11595:
---
Description: 
When handling {{ADD JAR}}, Spark constructs a {{java.io.File}} first using the 
input jar path, and then converts it into a URL 
([here|https://github.com/apache/spark/pull/8909/files#diff-d613c921507243c65591c003a348f5f3R180]).
 This works file for local file path without a URL scheme (e.g. 
{{/tmp/a.jar}}). However, {{java.io.File.toURI}} returns unexpected result when 
given a path containing a URL scheme (e.g. {{file:///tmp/a.jar}} or 
{{hdfs://host:9000/path/to/a.jar}}):
{noformat}
scala> new java.io.File("file:///tmp/file").toURI
res1: java.net.URI = file:/Users/lian/local/src/spark/workspace-a/file:/tmp/file
{noformat}
The consequence is that, although the {{ADD JAR}} command doesn't fail 
immediately, the jar is actually not added properly.

  was:
When handling {{ADD JAR}}, PR #8909 constructs a {{java.io.File}} first using 
the input jar path, and then converts it into a URL 
([here|https://github.com/apache/spark/pull/8909/files#diff-d613c921507243c65591c003a348f5f3R180]).
 This works file for local file path without a URL scheme (e.g. 
{{/tmp/a.jar}}). However, {{java.io.File.toURI}} returns unexpected result when 
given a path containing a URL scheme (e.g. {{file:///tmp/a.jar}} or 
{{hdfs://host:9000/path/to/a.jar}}):
{noformat}
scala> new java.io.File("file:///tmp/file").toURI
res1: java.net.URI = file:/Users/lian/local/src/spark/workspace-a/file:/tmp/file
{noformat}
The consequence is that, although the {{ADD JAR}} command doesn't fail 
immediately, the jar is actually not added properly.


> "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" 
> and "hdfs:/"
> ---
>
> Key: SPARK-11595
> URL: https://issues.apache.org/jira/browse/SPARK-11595
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>
> When handling {{ADD JAR}}, Spark constructs a {{java.io.File}} first using 
> the input jar path, and then converts it into a URL 
> ([here|https://github.com/apache/spark/pull/8909/files#diff-d613c921507243c65591c003a348f5f3R180]).
>  This works file for local file path without a URL scheme (e.g. 
> {{/tmp/a.jar}}). However, {{java.io.File.toURI}} returns unexpected result 
> when given a path containing a URL scheme (e.g. {{file:///tmp/a.jar}} or 
> {{hdfs://host:9000/path/to/a.jar}}):
> {noformat}
> scala> new java.io.File("file:///tmp/file").toURI
> res1: java.net.URI = 
> file:/Users/lian/local/src/spark/workspace-a/file:/tmp/file
> {noformat}
> The consequence is that, although the {{ADD JAR}} command doesn't fail 
> immediately, the jar is actually not added properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11530) Return eigenvalues with PCA model

2015-11-09 Thread Christos Iraklis Tsatsoulis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996533#comment-14996533
 ] 

Christos Iraklis Tsatsoulis commented on SPARK-11530:
-

I edited it to target both; there are ``PCA.scala`` scripts for both ML & 
MLLib, but since I am using it via PySpark, where it is available only via ML, 
I initially omitted MLlib

> Return eigenvalues with PCA model
> -
>
> Key: SPARK-11530
> URL: https://issues.apache.org/jira/browse/SPARK-11530
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.1
>Reporter: Christos Iraklis Tsatsoulis
>
> For data scientists & statisticians, PCA is of little use if they cannot 
> estimate the _proportion of variance explained_ by selecting _k_ principal 
> components (see here for the math details: 
> https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section 
> 'Explained variance'). To estimate this, one only needs the eigenvalues of 
> the covariance matrix.
> Although the eigenvalues are currently computed during PCA model fitting, 
> they are not _returned_; hence, as it stands now, PCA in Spark ML is of 
> extremely limited practical use.
> For details, see these SO questions
> http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/
>  (pyspark)
> http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala)
> and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11530) Return eigenvalues with PCA model

2015-11-09 Thread Christos Iraklis Tsatsoulis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996533#comment-14996533
 ] 

Christos Iraklis Tsatsoulis edited comment on SPARK-11530 at 11/9/15 1:37 PM:
--

I edited it to target both; there are PCA.scala scripts for both 
ML & MLLib, but since I am using it via PySpark, where it is available only via 
ML, I initially omitted MLlib


was (Author: ctsats):
I edited it to target both; there are ``PCA.scala`` scripts for both ML & 
MLLib, but since I am using it via PySpark, where it is available only via ML, 
I initially omitted MLlib

> Return eigenvalues with PCA model
> -
>
> Key: SPARK-11530
> URL: https://issues.apache.org/jira/browse/SPARK-11530
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.1
>Reporter: Christos Iraklis Tsatsoulis
>
> For data scientists & statisticians, PCA is of little use if they cannot 
> estimate the _proportion of variance explained_ by selecting _k_ principal 
> components (see here for the math details: 
> https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section 
> 'Explained variance'). To estimate this, one only needs the eigenvalues of 
> the covariance matrix.
> Although the eigenvalues are currently computed during PCA model fitting, 
> they are not _returned_; hence, as it stands now, PCA in Spark ML is of 
> extremely limited practical use.
> For details, see these SO questions
> http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/
>  (pyspark)
> http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala)
> and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10280) Add @since annotation to pyspark.ml.classification

2015-11-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10280.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8690
[https://github.com/apache/spark/pull/8690]

> Add @since annotation to pyspark.ml.classification
> --
>
> Key: SPARK-10280
> URL: https://issues.apache.org/jira/browse/SPARK-10280
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9297) covar_pop and covar_samp aggregate functions

2015-11-09 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9297:

Target Version/s:   (was: 1.6.0)

> covar_pop and covar_samp aggregate functions
> 
>
> Key: SPARK-9297
> URL: https://issues.apache.org/jira/browse/SPARK-9297
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7841) Spark build should not use lib_managed for dependencies

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7841:
---

Assignee: Apache Spark

> Spark build should not use lib_managed for dependencies
> ---
>
> Key: SPARK-7841
> URL: https://issues.apache.org/jira/browse/SPARK-7841
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Iulian Dragos
>Assignee: Apache Spark
>  Labels: easyfix, sbt
>
> - unnecessary duplication (I will have those libraries under ./m2, via maven 
> anyway)
> - every time I call make-distribution I lose lib_managed (via mvn clean 
> install) and have to wait to download again all jars next time I use sbt
> - Eclipse does not handle relative paths very well (source attachments from 
> lib_managed don’t always work)
> - it's not the default configuration. If we stray from defaults I think there 
> should be a clear advantage.
> Digging through history, the only reference to `retrieveManaged := true` I 
> found was in f686e3d, from July 2011 ("Initial work on converting build to 
> SBT 0.10.1"). My guess this is purely an accident of porting the build form 
> Sbt 0.7.x and trying to keep the old project layout.
> If there are reasons for keeping it, please comment (I didn't get any answers 
> on the [dev mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/Why-use-quot-lib-managed-quot-for-the-Sbt-build-td12361.html])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7841) Spark build should not use lib_managed for dependencies

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7841:
---

Assignee: (was: Apache Spark)

> Spark build should not use lib_managed for dependencies
> ---
>
> Key: SPARK-7841
> URL: https://issues.apache.org/jira/browse/SPARK-7841
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Iulian Dragos
>  Labels: easyfix, sbt
>
> - unnecessary duplication (I will have those libraries under ./m2, via maven 
> anyway)
> - every time I call make-distribution I lose lib_managed (via mvn clean 
> install) and have to wait to download again all jars next time I use sbt
> - Eclipse does not handle relative paths very well (source attachments from 
> lib_managed don’t always work)
> - it's not the default configuration. If we stray from defaults I think there 
> should be a clear advantage.
> Digging through history, the only reference to `retrieveManaged := true` I 
> found was in f686e3d, from July 2011 ("Initial work on converting build to 
> SBT 0.10.1"). My guess this is purely an accident of porting the build form 
> Sbt 0.7.x and trying to keep the old project layout.
> If there are reasons for keeping it, please comment (I didn't get any answers 
> on the [dev mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/Why-use-quot-lib-managed-quot-for-the-Sbt-build-td12361.html])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7841) Spark build should not use lib_managed for dependencies

2015-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997489#comment-14997489
 ] 

Apache Spark commented on SPARK-7841:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9575

> Spark build should not use lib_managed for dependencies
> ---
>
> Key: SPARK-7841
> URL: https://issues.apache.org/jira/browse/SPARK-7841
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Iulian Dragos
>  Labels: easyfix, sbt
>
> - unnecessary duplication (I will have those libraries under ./m2, via maven 
> anyway)
> - every time I call make-distribution I lose lib_managed (via mvn clean 
> install) and have to wait to download again all jars next time I use sbt
> - Eclipse does not handle relative paths very well (source attachments from 
> lib_managed don’t always work)
> - it's not the default configuration. If we stray from defaults I think there 
> should be a clear advantage.
> Digging through history, the only reference to `retrieveManaged := true` I 
> found was in f686e3d, from July 2011 ("Initial work on converting build to 
> SBT 0.10.1"). My guess this is purely an accident of porting the build form 
> Sbt 0.7.x and trying to keep the old project layout.
> If there are reasons for keeping it, please comment (I didn't get any answers 
> on the [dev mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/Why-use-quot-lib-managed-quot-for-the-Sbt-build-td12361.html])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11599) NPE when resolve a non-existent function

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11599:


Assignee: (was: Apache Spark)

> NPE when resolve a non-existent function
> 
>
> Key: SPARK-11599
> URL: https://issues.apache.org/jira/browse/SPARK-11599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.Registry.getFunctionInfo(Registry.java:254)
>   at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:466)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:59)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:55)
>   at scala.util.Try.getOrElse(Try.scala:77)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:55)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5$$anonfun$applyOrElse$21.apply(Analyzer.scala:527)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5$$anonfun$applyOrElse$21.apply(Analyzer.scala:527)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:526)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:523)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:228)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:228)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:233)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:233)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:250)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:280)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:233)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:75)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:85)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:89)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:89)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:93)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 

[jira] [Assigned] (SPARK-11599) NPE when resolve a non-existent function

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11599:


Assignee: Apache Spark

> NPE when resolve a non-existent function
> 
>
> Key: SPARK-11599
> URL: https://issues.apache.org/jira/browse/SPARK-11599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.Registry.getFunctionInfo(Registry.java:254)
>   at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:466)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:59)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:55)
>   at scala.util.Try.getOrElse(Try.scala:77)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:55)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5$$anonfun$applyOrElse$21.apply(Analyzer.scala:527)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5$$anonfun$applyOrElse$21.apply(Analyzer.scala:527)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:526)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:523)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:228)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:228)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:233)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:233)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:250)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:280)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:233)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:75)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:85)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:89)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:89)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:93)
>   at 

[jira] [Updated] (SPARK-11574) Spark should support StatsD sink out of box

2015-11-09 Thread Xiaofeng Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaofeng Lin updated SPARK-11574:
-
Affects Version/s: (was: 1.5.2)
   1.5.1

> Spark should support StatsD sink out of box
> ---
>
> Key: SPARK-11574
> URL: https://issues.apache.org/jira/browse/SPARK-11574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Xiaofeng Lin
>
> In order to run spark in production, monitoring is essential. StatsD is such 
> a common metric reporting mechanism that it should be supported out of the 
> box.  This will enable publishing metrics to monitoring services like 
> datadog, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9301) collect_set and collect_list aggregate functions

2015-11-09 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9301:

Target Version/s:   (was: 1.6.0)

> collect_set and collect_list aggregate functions
> 
>
> Key: SPARK-9301
> URL: https://issues.apache.org/jira/browse/SPARK-9301
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9300) histogram_numeric aggregate function

2015-11-09 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9300:

Target Version/s:   (was: 1.6.0)

> histogram_numeric aggregate function
> 
>
> Key: SPARK-9300
> URL: https://issues.apache.org/jira/browse/SPARK-9300
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9299) percentile and percentile_approx aggregate functions

2015-11-09 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9299:

Target Version/s:   (was: 1.6.0)

> percentile and percentile_approx aggregate functions
> 
>
> Key: SPARK-9299
> URL: https://issues.apache.org/jira/browse/SPARK-9299
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11599) NPE when resolve a non-existent function

2015-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997501#comment-14997501
 ] 

Apache Spark commented on SPARK-11599:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/9576

> NPE when resolve a non-existent function
> 
>
> Key: SPARK-11599
> URL: https://issues.apache.org/jira/browse/SPARK-11599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.Registry.getFunctionInfo(Registry.java:254)
>   at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:466)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:59)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:55)
>   at scala.util.Try.getOrElse(Try.scala:77)
>   at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:55)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5$$anonfun$applyOrElse$21.apply(Analyzer.scala:527)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5$$anonfun$applyOrElse$21.apply(Analyzer.scala:527)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:526)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:523)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:228)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:228)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:233)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:233)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:250)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:280)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:233)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:75)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:85)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:89)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:89)
>   at 
> 

[jira] [Updated] (SPARK-11587) SparkR can not use summary.glm from base R

2015-11-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11587:
--
Priority: Critical  (was: Major)

> SparkR can not use summary.glm from base R
> --
>
> Key: SPARK-11587
> URL: https://issues.apache.org/jira/browse/SPARK-11587
> Project: Spark
>  Issue Type: Bug
>  Components: ML, R, SparkR
>Reporter: Yanbo Liang
>Priority: Critical
>
> When we use summary method of base R(not method of SparkR) in SparkR console:
> {code}
> model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris)
> summary(model)
> {code}
> It returns
> {code}
> Error in (function (classes, fdef, mtable)  : 
>   unable to find an inherited method for function ‘summary’ for signature 
> ‘"glm”’
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11326) Support for authentication and encryption in standalone mode

2015-11-09 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997448#comment-14997448
 ] 

Patrick Wendell commented on SPARK-11326:
-

There are a few related conversations here:

1. The feature set of standalone scheduler and goals. The main goal of that 
scheduler is to make it easy for people to download and run Spark with minimal 
extra dependencies. The main difference between the standalone mode and other 
schedulers is that we aren't providing support for scheduling other frameworks 
than Spark (and likely never will). Other than that, features are added on a 
case-by-case basis depending on whether there is sufficient commitment from the 
maintainers to support the feature long term.

2. Security in non-YARN modes. I would actually like to see better support for 
security in other modes of Spark, the main reason being supporting the large 
number of users not inside of Hadoop deployments. BTW, I think the existing 
security architecture of Spark makes this possible, because the concern of 
distributing a shared secret is largely decoupled from the specific security 
mechanism. But we haven't really exposed public hooks for injecting secrets. 
There is also the question of secure job submission which is addressed in this 
JIRA. This needs some thought and probably makes sense to discuss on the Spark 
1.7 timeframe.

Overall I think some broader questions need to be answered, and it's something 
perhaps we can discuss once 1.6 is out the door as we think about 1.7.

> Support for authentication and encryption in standalone mode
> 
>
> Key: SPARK-11326
> URL: https://issues.apache.org/jira/browse/SPARK-11326
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Jacek Lewandowski
>
> h3.The idea
> Currently, in standalone mode, all components, for all network connections 
> need to use the same secure token if they want to have any security ensured. 
> This ticket is intended to split the communication in standalone mode to make 
> it more like in Yarn mode - application internal communication and scheduler 
> communication.
> Such refactoring will allow for the scheduler (master, workers) to use a 
> distinct secret, which will remain unknown for the users. Similarly, it will 
> allow for better security in applications, because each application will be 
> able to use a distinct secret as well. 
> By providing SASL authentication/encryption for connections between a client 
> (Client or AppClient) and Spark Master, it becomes possible introducing 
> pluggable authentication for standalone deployment mode.
> h3.Improvements introduced by this patch
> This patch introduces the following changes:
> * Spark driver or submission client do not have to use the same secret as 
> workers use to communicate with Master
> * Master is able to authenticate individual clients with the following rules:
> ** When connecting to the master, the client needs to specify 
> {{spark.authenticate.secret}} which is an authentication token for the user 
> specified by {{spark.authenticate.user}} ({{sparkSaslUser}} by default)
> ** Master configuration may include additional 
> {{spark.authenticate.secrets.}} entries for specifying 
> authentication token for particular users or 
> {{spark.authenticate.authenticatorClass}} which specify an implementation of 
> external credentials provider (which is able to retrieve the authentication 
> token for a given user).
> ** Workers authenticate with Master as default user {{sparkSaslUser}}. 
> * The authorization rules are as follows:
> ** A regular user is able to manage only his own application (the application 
> which he submitted)
> ** A regular user is not able to register or manager workers
> ** Spark default user {{sparkSaslUser}} can manage all the applications
> h3.User facing changes when running application
> h4.General principles:
> - conf: {{spark.authenticate.secret}} is *never sent* over the wire
> - env: {{SPARK_AUTH_SECRET}} is *never sent* over the wire
> - In all situations env variable will overwrite conf variable if present. 
> - In all situations when a user has to pass a secret, it is better (safer) to 
> do this through env variable
> - In work modes with multiple secrets we assume encrypted communication 
> between client and master, between driver and master, between master and 
> workers
> 
> h4.Work modes and descriptions
> h5.Client mode, single secret
> h6.Configuration
> - env: {{SPARK_AUTH_SECRET=secret}} or conf: 
> {{spark.authenticate.secret=secret}}
> h6.Description
> - The driver is running locally
> - The driver will neither send env: {{SPARK_AUTH_SECRET}} nor conf: 
> {{spark.authenticate.secret}}
> - The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: 
> 

[jira] [Resolved] (SPARK-11508) Add Python API for repartition and sortWithinPartitions

2015-11-09 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11508.
--
   Resolution: Fixed
 Assignee: Nong Li
Fix Version/s: 1.6.0

It has been resolved by https://github.com/apache/spark/pull/9504.

> Add Python API for repartition and sortWithinPartitions
> ---
>
> Key: SPARK-11508
> URL: https://issues.apache.org/jira/browse/SPARK-11508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nong Li
> Fix For: 1.6.0
>
>
> We added a few new methods in 1.6 that are still missing in Python:
> {code}
> def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame
> def repartition(partitionExprs: Column*): DataFrame
> def sortWithinPartitions(sortExprs: Column*): DataFrame
> def sortWithinPartitions(sortCol: String, sortCols: String*): DataFrame
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11574) Spark should support StatsD sink out of box

2015-11-09 Thread Xiaofeng Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaofeng Lin updated SPARK-11574:
-
Affects Version/s: 1.6.0

> Spark should support StatsD sink out of box
> ---
>
> Key: SPARK-11574
> URL: https://issues.apache.org/jira/browse/SPARK-11574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Xiaofeng Lin
>
> In order to run spark in production, monitoring is essential. StatsD is such 
> a common metric reporting mechanism that it should be supported out of the 
> box.  This will enable publishing metrics to monitoring services like 
> datadog, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11552) Replace example code in ml-decision-tree.md using include_example

2015-11-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11552.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9539
[https://github.com/apache/spark/pull/9539]

> Replace example code in ml-decision-tree.md using include_example
> -
>
> Key: SPARK-11552
> URL: https://issues.apache.org/jira/browse/SPARK-11552
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11552) Replace example code in ml-decision-tree.md using include_example

2015-11-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11552:
--
Target Version/s: 1.6.0
Priority: Minor  (was: Major)
 Component/s: ML

> Replace example code in ml-decision-tree.md using include_example
> -
>
> Key: SPARK-11552
> URL: https://issues.apache.org/jira/browse/SPARK-11552
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xusen Yin
>Assignee: sachin aggarwal
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11462) Add JavaStreamingListener

2015-11-09 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-11462.
---
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 1.6.0

> Add JavaStreamingListener
> -
>
> Key: SPARK-11462
> URL: https://issues.apache.org/jira/browse/SPARK-11462
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>
> Add Java friendly API for StreamingListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2015-11-09 Thread swetha k (JIRA)
swetha k created SPARK-11620:


 Summary: parquet.hadoop.ParquetOutputCommitter.commitJob() throws 
parquet.io.ParquetEncodingException
 Key: SPARK-11620
 URL: https://issues.apache.org/jira/browse/SPARK-11620
 Project: Spark
  Issue Type: Bug
Reporter: swetha k






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11587) SparkR can not use summary.glm from base R

2015-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997842#comment-14997842
 ] 

Apache Spark commented on SPARK-11587:
--

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/9582

> SparkR can not use summary.glm from base R
> --
>
> Key: SPARK-11587
> URL: https://issues.apache.org/jira/browse/SPARK-11587
> Project: Spark
>  Issue Type: Bug
>  Components: ML, R, SparkR
>Reporter: Yanbo Liang
>Priority: Critical
>
> When we use summary method of base R(not method of SparkR) in SparkR console:
> {code}
> model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris)
> summary(model)
> {code}
> It returns
> {code}
> Error in (function (classes, fdef, mtable)  : 
>   unable to find an inherited method for function ‘summary’ for signature 
> ‘"glm”’
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11587) SparkR can not use summary.glm from base R

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11587:


Assignee: (was: Apache Spark)

> SparkR can not use summary.glm from base R
> --
>
> Key: SPARK-11587
> URL: https://issues.apache.org/jira/browse/SPARK-11587
> Project: Spark
>  Issue Type: Bug
>  Components: ML, R, SparkR
>Reporter: Yanbo Liang
>Priority: Critical
>
> When we use summary method of base R(not method of SparkR) in SparkR console:
> {code}
> model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris)
> summary(model)
> {code}
> It returns
> {code}
> Error in (function (classes, fdef, mtable)  : 
>   unable to find an inherited method for function ‘summary’ for signature 
> ‘"glm”’
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11587) SparkR can not use summary.glm from base R

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11587:


Assignee: Apache Spark

> SparkR can not use summary.glm from base R
> --
>
> Key: SPARK-11587
> URL: https://issues.apache.org/jira/browse/SPARK-11587
> Project: Spark
>  Issue Type: Bug
>  Components: ML, R, SparkR
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Critical
>
> When we use summary method of base R(not method of SparkR) in SparkR console:
> {code}
> model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris)
> summary(model)
> {code}
> It returns
> {code}
> Error in (function (classes, fdef, mtable)  : 
>   unable to find an inherited method for function ‘summary’ for signature 
> ‘"glm”’
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11611) Python API for bisecting k-means

2015-11-09 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997875#comment-14997875
 ] 

Yu Ishikawa commented on SPARK-11611:
-

[~mengxr] can we change the target version from 1.7.0 to 1.6.0?

> Python API for bisecting k-means
> 
>
> Key: SPARK-11611
> URL: https://issues.apache.org/jira/browse/SPARK-11611
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>
> Implement Python API for bisecting k-means.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-8506) SparkR does not provide an easy way to depend on Spark Packages when performing init from inside of R

2015-11-09 Thread Girish Reddy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Girish Reddy updated SPARK-8506:

Comment: was deleted

(was: Hi [~holdenk] - I am getting an error when specifying multiple packages 
with a comma separating them.  Is there an example showing how multiple 
packages can be specified in the argument?)

> SparkR does not provide an easy way to depend on Spark Packages when 
> performing init from inside of R
> -
>
> Key: SPARK-8506
> URL: https://issues.apache.org/jira/browse/SPARK-8506
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 1.4.1, 1.5.0
>
>
> While packages can be specified when using the sparkR or sparkSubmit scripts, 
> the programming guide tells people to create their spark context using the R 
> shell + init. The init does have a parameter for jars but no parameter for 
> packages. Setting the SPARKR_SUBMIT_ARGS overwrites some necessary 
> information. I think a good solution would just be adding another field to 
> the init function to allow people to specify packages in the same way as jars.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2015-11-09 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998015#comment-14998015
 ] 

swetha k edited comment on SPARK-11620 at 11/10/15 6:07 AM:


I see the following Warning message when I use parquet-avro in my Spark Batch. 
Following is the dependency that I use.


com.twitter
parquet-avro
1.6.0


Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could 
not write summary file for active_sessions_current 
parquet.io.ParquetEncodingException: 
maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all 
the files must be contained in the root active_sessions_current 
at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) 
at 
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) 
at 
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) 
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998)


was (Author: swethakasireddy):
I see the following Warning message when I use parquet-avro. Following is the 
dependency that I use.


com.twitter
parquet-avro
1.6.0


Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could 
not write summary file for active_sessions_current 
parquet.io.ParquetEncodingException: 
maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all 
the files must be contained in the root active_sessions_current 
at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) 
at 
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) 
at 
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) 
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998)

> parquet.hadoop.ParquetOutputCommitter.commitJob() throws 
> parquet.io.ParquetEncodingException
> 
>
> Key: SPARK-11620
> URL: https://issues.apache.org/jira/browse/SPARK-11620
> Project: Spark
>  Issue Type: Bug
>Reporter: swetha k
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2015-11-09 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998015#comment-14998015
 ] 

swetha k commented on SPARK-11620:
--

I see the following Warning message when I use parquet-avro. Following is the 
dependency that I use.


com.twitter
parquet-avro
1.6.0


Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could 
not write summary file for active_sessions_current 
parquet.io.ParquetEncodingException: 
maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all 
the files must be contained in the root active_sessions_current 
at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) 
at 
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) 
at 
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) 
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998)

> parquet.hadoop.ParquetOutputCommitter.commitJob() throws 
> parquet.io.ParquetEncodingException
> 
>
> Key: SPARK-11620
> URL: https://issues.apache.org/jira/browse/SPARK-11620
> Project: Spark
>  Issue Type: Bug
>Reporter: swetha k
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11618) Refactoring of basic ML import/export

2015-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998016#comment-14998016
 ] 

Apache Spark commented on SPARK-11618:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/9587

> Refactoring of basic ML import/export
> -
>
> Key: SPARK-11618
> URL: https://issues.apache.org/jira/browse/SPARK-11618
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> This is for a few updates to the original PR for basic ML import/export in 
> [SPARK-11217].
> * The original PR diverges from the design doc in that it does not include 
> the Spark version or a model format version.  We should include the Spark 
> version in the metadata.  If we do that, then we don't really need a model 
> format version.
> * Proposal: DefaultParamsWriter includes two separable pieces of logic in 
> save(): (a) handling overwriting and (b) saving Params.  I want to separate 
> these by putting (a) in a save() method in Writer which calls an abstract 
> saveImpl, and (b) in the saveImpl implementation in DefaultParamsWriter.  
> This is described below:
> {code}
> abstract class Writer {
>   def save(path: String) = {
> // handle overwrite
> saveImpl(path)
>   }
>   def saveImpl(path: String)   // abstract
> }
> class DefaultParamsWriter extends Writer {
>   def saveImpl(path: String) = {
> // save Params
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11141) Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes

2015-11-09 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-11141.
---
   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 1.6.0

> Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes
> --
>
> Key: SPARK-11141
> URL: https://issues.apache.org/jira/browse/SPARK-11141
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 1.6.0
>
>
> When using S3 as a directory for WALs, the writes take too long. The driver 
> gets very easily bottlenecked when multiple receivers send AddBlock events to 
> the ReceiverTracker. This PR adds batching of events in the 
> ReceivedBlockTracker so that receivers don't get blocked by the driver for 
> too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11580) Just do final aggregation when there is no Exchange

2015-11-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11580:
-
Target Version/s:   (was: 1.6.0)

> Just do final aggregation when there is no Exchange
> ---
>
> Key: SPARK-11580
> URL: https://issues.apache.org/jira/browse/SPARK-11580
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Yadong Qi
>
> I do the SQL as below:
> {code}
> cache table src as select * from src distribute by key;
> select key, count(value) from src group by key;
> {code}
> and the Physical Plan is 
> {code}
> TungstenAggregate(key=[key#0], 
> functions=[(count(value#1),mode=Final,isDistinct=false)], 
> output=[key#0,_c1#28L])
>  TungstenAggregate(key=[key#0], 
> functions=[(count(value#1),mode=Partial,isDistinct=false)], 
> output=[key#0,currentCount#41L])
>   InMemoryColumnarTableScan [key#0,value#1], (InMemoryRelation 
> [key#0,value#1], true, 1, StorageLevel(true, true, false, true, 1), 
> (TungstenExchange hashpartitioning(key#0)), Some(src))
> {code}
> I think if there is no *Exchange*, just do final aggregation is better, like:
> {code}
> TungstenAggregate(key=[key#0], 
> functions=[(count(value#1),mode=Final,isDistinct=false)], 
> output=[key#0,_c1#28L])
>   InMemoryColumnarTableScan [key#0,value#1], (InMemoryRelation 
> [key#0,value#1], true, 1, StorageLevel(true, true, false, true, 1), 
> (TungstenExchange hashpartitioning(key#0)), Some(src))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8506) SparkR does not provide an easy way to depend on Spark Packages when performing init from inside of R

2015-11-09 Thread Girish Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998000#comment-14998000
 ] 

Girish Reddy commented on SPARK-8506:
-

Hi [~holdenk] - I am getting an error when specifying multiple packages with a 
comma separating them.  Is there an example showing how multiple packages can 
be specified in the argument?

> SparkR does not provide an easy way to depend on Spark Packages when 
> performing init from inside of R
> -
>
> Key: SPARK-8506
> URL: https://issues.apache.org/jira/browse/SPARK-8506
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 1.4.1, 1.5.0
>
>
> While packages can be specified when using the sparkR or sparkSubmit scripts, 
> the programming guide tells people to create their spark context using the R 
> shell + init. The init does have a parameter for jars but no parameter for 
> packages. Setting the SPARKR_SUBMIT_ARGS overwrites some necessary 
> information. I think a good solution would just be adding another field to 
> the init function to allow people to specify packages in the same way as jars.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7675) PySpark spark.ml Params type conversions

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7675:
---

Assignee: Apache Spark

> PySpark spark.ml Params type conversions
> 
>
> Key: SPARK-7675
> URL: https://issues.apache.org/jira/browse/SPARK-7675
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, PySpark wrappers for spark.ml Scala classes are brittle when 
> accepting Param types.  E.g., Normalizer's "p" param cannot be set to "2" (an 
> integer); it must be set to "2.0" (a float).  Fixing this is not trivial 
> since there does not appear to be a natural place to insert the conversion 
> before Python wrappers call Java's Params setter method.
> A possible fix will be to include a method "_checkType" to PySpark's Param 
> class which checks the type, prints an error if needed, and converts types 
> when relevant (e.g., int to float, or scipy matrix to array).  The Java 
> wrapper method which copies params to Scala can call this method when 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7675) PySpark spark.ml Params type conversions

2015-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997814#comment-14997814
 ] 

Apache Spark commented on SPARK-7675:
-

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/9581

> PySpark spark.ml Params type conversions
> 
>
> Key: SPARK-7675
> URL: https://issues.apache.org/jira/browse/SPARK-7675
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Currently, PySpark wrappers for spark.ml Scala classes are brittle when 
> accepting Param types.  E.g., Normalizer's "p" param cannot be set to "2" (an 
> integer); it must be set to "2.0" (a float).  Fixing this is not trivial 
> since there does not appear to be a natural place to insert the conversion 
> before Python wrappers call Java's Params setter method.
> A possible fix will be to include a method "_checkType" to PySpark's Param 
> class which checks the type, prints an error if needed, and converts types 
> when relevant (e.g., int to float, or scipy matrix to array).  The Java 
> wrapper method which copies params to Scala can call this method when 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7675) PySpark spark.ml Params type conversions

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7675:
---

Assignee: (was: Apache Spark)

> PySpark spark.ml Params type conversions
> 
>
> Key: SPARK-7675
> URL: https://issues.apache.org/jira/browse/SPARK-7675
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Currently, PySpark wrappers for spark.ml Scala classes are brittle when 
> accepting Param types.  E.g., Normalizer's "p" param cannot be set to "2" (an 
> integer); it must be set to "2.0" (a float).  Fixing this is not trivial 
> since there does not appear to be a natural place to insert the conversion 
> before Python wrappers call Java's Params setter method.
> A possible fix will be to include a method "_checkType" to PySpark's Param 
> class which checks the type, prints an error if needed, and converts types 
> when relevant (e.g., int to float, or scipy matrix to array).  The Java 
> wrapper method which copies params to Scala can call this method when 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11618) Refactoring of basic ML import/export

2015-11-09 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-11618:
-

 Summary: Refactoring of basic ML import/export
 Key: SPARK-11618
 URL: https://issues.apache.org/jira/browse/SPARK-11618
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


This is for a few updates to the original PR for basic ML import/export in 
[SPARK-11217].
* The original PR diverges from the design doc in that it does not include the 
Spark version or a model format version.  We should include the Spark version 
in the metadata.  If we do that, then we don't really need a model format 
version.
* Proposal: DefaultParamsWriter includes two separable pieces of logic in 
save(): (a) handling overwriting and (b) saving Params.  I want to separate 
these by putting (a) in a save() method in Writer which calls an abstract 
saveImpl, and (b) in the saveImpl implementation in DefaultParamsWriter.  This 
is described below:

{code}
abstract class Writer {
  def save(path: String) = {
// handle overwrite
saveImpl(path)
  }
  def saveImpl(path: String)   // abstract
}

class DefaultParamsWriter extends Writer {
  def saveImpl(path: String) = {
// save Params
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11621) ORC filter pushdown not working properly after new unhandled filter interface.

2015-11-09 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998014#comment-14998014
 ] 

Hyukjin Kwon commented on SPARK-11621:
--

I would like to work this.

> ORC filter pushdown not working properly after new unhandled filter interface.
> --
>
> Key: SPARK-11621
> URL: https://issues.apache.org/jira/browse/SPARK-11621
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hyukjin Kwon
>
> After we get the new interface to get rid of filters predicate-push-downed 
> which are processed in datasource-level 
> (https://github.com/apache/spark/pull/9399), it dose not push down filters 
> for ORC.
> This is because at {{DataSourceStrategy}}, it is classified to scanning 
> non-partitioned HadoopFsRelation, and all the filters are treated as 
> unhandled filters.
> Also, since ORC does not support to filter fully record by record but instead 
> rough results came out, the filters for ORC should not go to unhandled 
> filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11621) ORC filter pushdown not working properly after new unhandled filter interface.

2015-11-09 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-11621:


 Summary: ORC filter pushdown not working properly after new 
unhandled filter interface.
 Key: SPARK-11621
 URL: https://issues.apache.org/jira/browse/SPARK-11621
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Hyukjin Kwon


After we get the new interface to get rid of filters predicate-push-downed 
which are processed in datasource-level 
(https://github.com/apache/spark/pull/9399), it dose not push down filters for 
ORC.

This is because at {{DataSourceStrategy}}, it is classified to scanning 
non-partitioned HadoopFsRelation, and all the filters are treated as unhandled 
filters.

Also, since ORC does not support to filter fully record by record but instead 
rough results came out, the filters for ORC should not go to unhandled filters.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11587) SparkR can not use summary.glm from base R

2015-11-09 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-11587.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Resolved by https://github.com/apache/spark/pull/9582

> SparkR can not use summary.glm from base R
> --
>
> Key: SPARK-11587
> URL: https://issues.apache.org/jira/browse/SPARK-11587
> Project: Spark
>  Issue Type: Bug
>  Components: ML, R, SparkR
>Reporter: Yanbo Liang
>Assignee: Shivaram Venkataraman
>Priority: Critical
> Fix For: 1.6.0
>
>
> When we use summary method of base R(not method of SparkR) in SparkR console:
> {code}
> model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris)
> summary(model)
> {code}
> It returns
> {code}
> Error in (function (classes, fdef, mtable)  : 
>   unable to find an inherited method for function ‘summary’ for signature 
> ‘"glm”’
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11587) SparkR can not use summary.glm from base R

2015-11-09 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman reassigned SPARK-11587:
-

Assignee: Shivaram Venkataraman

> SparkR can not use summary.glm from base R
> --
>
> Key: SPARK-11587
> URL: https://issues.apache.org/jira/browse/SPARK-11587
> Project: Spark
>  Issue Type: Bug
>  Components: ML, R, SparkR
>Reporter: Yanbo Liang
>Assignee: Shivaram Venkataraman
>Priority: Critical
> Fix For: 1.6.0
>
>
> When we use summary method of base R(not method of SparkR) in SparkR console:
> {code}
> model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris)
> summary(model)
> {code}
> It returns
> {code}
> Error in (function (classes, fdef, mtable)  : 
>   unable to find an inherited method for function ‘summary’ for signature 
> ‘"glm”’
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11587) SparkR can not use summary.glm from base R

2015-11-09 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997967#comment-14997967
 ] 

Shivaram Venkataraman commented on SPARK-11587:
---

[~yanboliang] Let me know if the fix works as expected.

> SparkR can not use summary.glm from base R
> --
>
> Key: SPARK-11587
> URL: https://issues.apache.org/jira/browse/SPARK-11587
> Project: Spark
>  Issue Type: Bug
>  Components: ML, R, SparkR
>Reporter: Yanbo Liang
>Assignee: Shivaram Venkataraman
>Priority: Critical
> Fix For: 1.6.0
>
>
> When we use summary method of base R(not method of SparkR) in SparkR console:
> {code}
> model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris)
> summary(model)
> {code}
> It returns
> {code}
> Error in (function (classes, fdef, mtable)  : 
>   unable to find an inherited method for function ‘summary’ for signature 
> ‘"glm”’
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11617) MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected

2015-11-09 Thread LingZhou (JIRA)
LingZhou created SPARK-11617:


 Summary: MEMORY LEAK: ByteBuf.release() was not called before it's 
garbage-collected
 Key: SPARK-11617
 URL: https://issues.apache.org/jira/browse/SPARK-11617
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.6.0
Reporter: LingZhou


The problem may be related to
 [SPARK-11235][NETWORK] Add ability to stream data using network lib.

while running on yarn-client mode, there are error messages:

15/11/09 10:23:55 ERROR util.ResourceLeakDetector: LEAK: ByteBuf.release() was 
not called before it's garbage-collected. Enable advanced leak reporting to 
find out where the leak occurred. To enable advanced leak reporting, specify 
the JVM option '-Dio.netty.leakDetectionLevel=advanced' or call 
ResourceLeakDetector.setLevel() See 
http://netty.io/wiki/reference-counted-objects.html for more information.

and then it will cause 
cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN 
for exceeding memory limits. 9.0 GB of 9 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.

and WARN scheduler.TaskSetManager: Lost task 105.0 in stage 1.0 (TID 2616, 
gsr489): java.lang.IndexOutOfBoundsException: index: 130828, length: 16833 
(expected: range(0, 524288)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11611) Python API for bisecting k-means

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11611:


Assignee: Apache Spark

> Python API for bisecting k-means
> 
>
> Key: SPARK-11611
> URL: https://issues.apache.org/jira/browse/SPARK-11611
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> Implement Python API for bisecting k-means.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11611) Python API for bisecting k-means

2015-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997872#comment-14997872
 ] 

Apache Spark commented on SPARK-11611:
--

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/9583

> Python API for bisecting k-means
> 
>
> Key: SPARK-11611
> URL: https://issues.apache.org/jira/browse/SPARK-11611
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>
> Implement Python API for bisecting k-means.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11619) cannot use UDTF in DataFrame.selectExpr

2015-11-09 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-11619:
---

 Summary: cannot use UDTF in DataFrame.selectExpr
 Key: SPARK-11619
 URL: https://issues.apache.org/jira/browse/SPARK-11619
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Priority: Minor


Currently if use UDTF like `explode`, `json_tuple` in `DataFrame.selectExpr`, 
it will be parsed into `UnresolvedFunction` first, and then alias it with 
`expr.prettyString`. However, UDTF may need MultiAlias so we will get error if 
we run:
{code}
val df = Seq((Map("1" -> 1), 1)).toDF("a", "b")
df.selectExpr("explode(a)").show()
{code}
[info]   org.apache.spark.sql.AnalysisException: Expect multiple names given 
for org.apache.spark.sql.catalyst.expressions.Explode,
[info] but only single name ''explode(a)' specified;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-11-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-11191:
---
Target Version/s: 1.5.2, 1.5.3  (was: 1.5.2, 1.5.3, 1.6.0)

> [1.5] Can't create UDF's using hive thrift service
> --
>
> Key: SPARK-11191
> URL: https://issues.apache.org/jira/browse/SPARK-11191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: David Ross
>Priority: Blocker
>
> Since upgrading to spark 1.5 we've been unable to create and use UDF's when 
> we run in thrift server mode.
> Our setup:
> We start the thrift-server running against yarn in client mode, (we've also 
> built our own spark from github branch-1.5 with the following args: {{-Pyarn 
> -Phive -Phive-thrifeserver}}
> If i run the following after connecting via JDBC (in this case via beeline):
> {{add jar 'hdfs://path/to/jar"}}
> (this command succeeds with no errors)
> {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}}
> (this command succeeds with no errors)
> {{select testUDF(col1) from table1;}}
> I get the following error in the logs:
> {code}
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 8
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
> at scala.util.Try.getOrElse(Try.scala:77)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
> {code}
> (cutting the bulk for ease of report, more than happy to send the full output)
> {code}
> 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 100
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> When I ran the same against 1.4 it worked.
> I've 

[jira] [Updated] (SPARK-6728) Improve performance of py4j for large bytearray

2015-11-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-6728:
--
Target Version/s:   (was: 1.6.0)

> Improve performance of py4j for large bytearray
> ---
>
> Key: SPARK-6728
> URL: https://issues.apache.org/jira/browse/SPARK-6728
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Davies Liu
>Priority: Critical
>
> PySpark relies on py4j to transfer function arguments and return between 
> Python and JVM, it's very slow to pass a large bytearray (larger than 10M). 
> In MLlib, it's possible to have a Vector with more than 100M bytes, which 
> will need few GB memory, may crash.
> The reason is that py4j use text protocol, it will encode the bytearray as 
> base64, and do multiple string concat. 
> Binary will help a lot, create a issue for py4j: 
> https://github.com/bartdag/py4j/issues/159



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10538) java.lang.NegativeArraySizeException during join

2015-11-09 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997051#comment-14997051
 ] 

Davies Liu commented on SPARK-10538:


[~maver1ck] Could you reproduce this issue in master or 1.6 branch ?

> java.lang.NegativeArraySizeException during join
> 
>
> Key: SPARK-10538
> URL: https://issues.apache.org/jira/browse/SPARK-10538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Assignee: Davies Liu
> Attachments: screenshot-1.png
>
>
> Hi,
> I've got a problem during joining tables in PySpark. (in my example 20 of 
> them)
> I can observe that during calculation of first partition (on one of 
> consecutive joins) there is a big shuffle read size (294.7 MB / 146 records) 
> vs on others partitions (approx. 272.5 KB / 113 record)
> I can also observe that just before the crash python process going up to few 
> gb of RAM.
> After some time there is an exception:
> {code}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I'm running this on 2 nodes cluster (12 cores, 64 GB RAM)
> Config:
> {code}
> spark.driver.memory  10g
> spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseParallelGC 
> -Dfile.encoding=UTF8
> spark.executor.memory   60g
> spark.storage.memoryFraction0.05
> spark.shuffle.memoryFraction0.75
> spark.driver.maxResultSize  10g  
> spark.cores.max 24
> spark.kryoserializer.buffer.max 1g
> spark.default.parallelism   200
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9865) Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame

2015-11-09 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-9865:
-
Assignee: Felix Cheung

> Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame
> -
>
> Key: SPARK-9865
> URL: https://issues.apache.org/jira/browse/SPARK-9865
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Davies Liu
>Assignee: Felix Cheung
> Fix For: 1.6.0
>
>
> 1. Failure (at test_sparkSQL.R#525): sample on a DataFrame 
> -
> count(sampled3) < 3 isn't true
> Error: Test failures
> Execution halted
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1468/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9865) Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame

2015-11-09 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-9865.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9549
[https://github.com/apache/spark/pull/9549]

> Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame
> -
>
> Key: SPARK-9865
> URL: https://issues.apache.org/jira/browse/SPARK-9865
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Davies Liu
> Fix For: 1.6.0
>
>
> 1. Failure (at test_sparkSQL.R#525): sample on a DataFrame 
> -
> count(sampled3) < 3 isn't true
> Error: Test failures
> Execution halted
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1468/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11319) PySpark silently Accepts null values in non-nullable DataFrame fields.

2015-11-09 Thread Daniel Jalova (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997100#comment-14997100
 ] 

Daniel Jalova commented on SPARK-11319:
---

I would like to work on this.

> PySpark silently Accepts null values in non-nullable DataFrame fields.
> --
>
> Key: SPARK-11319
> URL: https://issues.apache.org/jira/browse/SPARK-11319
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Kevin Cox
>
> Running the following code with a null value in a non-nullable column 
> silently works. This makes the code incredibly hard to trust.
> {code}
> In [2]: from pyspark.sql.types import *
> In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a", 
> TimestampType(), False)])).collect()
> Out[3]: [Row(a=None)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11597) improve performance of array and map encoder

2015-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997127#comment-14997127
 ] 

Apache Spark commented on SPARK-11597:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9572

> improve performance of array and map encoder
> 
>
> Key: SPARK-11597
> URL: https://issues.apache.org/jira/browse/SPARK-11597
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11597) improve performance of array and map encoder

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11597:


Assignee: (was: Apache Spark)

> improve performance of array and map encoder
> 
>
> Key: SPARK-11597
> URL: https://issues.apache.org/jira/browse/SPARK-11597
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11597) improve performance of array and map encoder

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11597:


Assignee: Apache Spark

> improve performance of array and map encoder
> 
>
> Key: SPARK-11597
> URL: https://issues.apache.org/jira/browse/SPARK-11597
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11598) Add tests for ShuffledHashOuterJoin

2015-11-09 Thread Davies Liu (JIRA)
Davies Liu created SPARK-11598:
--

 Summary: Add tests for ShuffledHashOuterJoin
 Key: SPARK-11598
 URL: https://issues.apache.org/jira/browse/SPARK-11598
 Project: Spark
  Issue Type: Test
Reporter: Davies Liu
Assignee: Davies Liu


We only test the default algorithm (SortMergeOuterJoin) for outer join, 
ShuffledHashOuterJoin is not well tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-11-09 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996706#comment-14996706
 ] 

Narine Kokhlikyan commented on SPARK-5575:
--

Hi [~avulanov] ,

I was trying out the current implementation of ANN and have one question about 
it.

Usually, when I run neuronal network with other tools such as R, I can 
additionally see information about: e.g.  Error, Reached Threshold and Steps.
Can I also somehow get such information from Spark ANN ? Maybe it is already 
there, I couldn't find it.

I looked through the implementations of GradientDecent and LBFGS and it seems 
that the optimizer.optimize doesn't return values about the error, number of 
iterations, etc.

I might be wrong here, still investigating it, but, I'd be happy to hear from 
you regarding this.

Thanks,
Narine


> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-11-09 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997028#comment-14997028
 ] 

Davies Liu commented on SPARK-11191:


This should work in master and 1.6.

> [1.5] Can't create UDF's using hive thrift service
> --
>
> Key: SPARK-11191
> URL: https://issues.apache.org/jira/browse/SPARK-11191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: David Ross
>Priority: Blocker
>
> Since upgrading to spark 1.5 we've been unable to create and use UDF's when 
> we run in thrift server mode.
> Our setup:
> We start the thrift-server running against yarn in client mode, (we've also 
> built our own spark from github branch-1.5 with the following args: {{-Pyarn 
> -Phive -Phive-thrifeserver}}
> If i run the following after connecting via JDBC (in this case via beeline):
> {{add jar 'hdfs://path/to/jar"}}
> (this command succeeds with no errors)
> {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}}
> (this command succeeds with no errors)
> {{select testUDF(col1) from table1;}}
> I get the following error in the logs:
> {code}
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 8
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
> at scala.util.Try.getOrElse(Try.scala:77)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
> {code}
> (cutting the bulk for ease of report, more than happy to send the full output)
> {code}
> 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 100
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> When I ran the same against 1.4 it 

[jira] [Commented] (SPARK-11373) Add metrics to the History Server and providers

2015-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997071#comment-14997071
 ] 

Apache Spark commented on SPARK-11373:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/9571

> Add metrics to the History Server and providers
> ---
>
> Key: SPARK-11373
> URL: https://issues.apache.org/jira/browse/SPARK-11373
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Steve Loughran
>
> The History server doesn't publish metrics about JVM load or anything from 
> the history provider plugins. This means that performance problems from 
> massive job histories aren't visible to management tools, and nor are any 
> provider-generated metrics such as time to load histories, failed history 
> loads, the number of connectivity failures talking to remote services, etc.
> If the history server set up a metrics registry and offered the option to 
> publish its metrics, then management tools could view this data.
> # the metrics registry would need to be passed down to the instantiated 
> {{ApplicationHistoryProvider}}, in order for it to register its metrics.
> # if the codahale metrics servlet were registered under a path such as 
> {{/metrics}}, the values would be visible as HTML and JSON, without the need 
> for management tools.
> # Integration tests could also retrieve the JSON-formatted data and use it as 
> part of the test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11373) Add metrics to the History Server and providers

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11373:


Assignee: Apache Spark

> Add metrics to the History Server and providers
> ---
>
> Key: SPARK-11373
> URL: https://issues.apache.org/jira/browse/SPARK-11373
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Steve Loughran
>Assignee: Apache Spark
>
> The History server doesn't publish metrics about JVM load or anything from 
> the history provider plugins. This means that performance problems from 
> massive job histories aren't visible to management tools, and nor are any 
> provider-generated metrics such as time to load histories, failed history 
> loads, the number of connectivity failures talking to remote services, etc.
> If the history server set up a metrics registry and offered the option to 
> publish its metrics, then management tools could view this data.
> # the metrics registry would need to be passed down to the instantiated 
> {{ApplicationHistoryProvider}}, in order for it to register its metrics.
> # if the codahale metrics servlet were registered under a path such as 
> {{/metrics}}, the values would be visible as HTML and JSON, without the need 
> for management tools.
> # Integration tests could also retrieve the JSON-formatted data and use it as 
> part of the test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10790) Dynamic Allocation does not request any executors if first stage needs less than or equal to spark.dynamicAllocation.initialExecutors

2015-11-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10790:
--
Affects Version/s: (was: 1.5.1)
   1.5.0

> Dynamic Allocation does not request any executors if first stage needs less 
> than or equal to spark.dynamicAllocation.initialExecutors
> -
>
> Key: SPARK-10790
> URL: https://issues.apache.org/jira/browse/SPARK-10790
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>Assignee: Saisai Shao
> Fix For: 1.5.2, 1.6.0
>
>
> If you set spark.dynamicAllocation.initialExecutors > 0 (or 
> spark.dynamicAllocation.minExecutors, since 
> spark.dynamicAllocation.initialExecutors defaults to 
> spark.dynamicAllocation.minExecutors), and the number of tasks in the first 
> stage of your job is less than or equal to this min/init number of executors, 
> dynamic allocation won't actually request any executors and will just hang 
> indefinitely with the warning "Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources".
> The cause appears to be that ExecutorAllocationManager does not request any 
> executors while the application is still initializing, but it still sets the 
> initial value of numExecutorsTarget to 
> spark.dynamicAllocation.initialExecutors. Once the job is running and has 
> submitted its first task, if the first task does not need more than 
> spark.dynamicAllocation.initialExecutors, 
> ExecutorAllocationManager.updateAndSyncNumExecutorsTarget() does not think 
> that it needs to request any executors, so it doesn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10790) Dynamic Allocation does not request any executors if first stage needs less than or equal to spark.dynamicAllocation.initialExecutors

2015-11-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10790:
--
Affects Version/s: (was: 1.5.0)
   1.5.1

> Dynamic Allocation does not request any executors if first stage needs less 
> than or equal to spark.dynamicAllocation.initialExecutors
> -
>
> Key: SPARK-10790
> URL: https://issues.apache.org/jira/browse/SPARK-10790
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>Assignee: Saisai Shao
> Fix For: 1.5.2, 1.6.0
>
>
> If you set spark.dynamicAllocation.initialExecutors > 0 (or 
> spark.dynamicAllocation.minExecutors, since 
> spark.dynamicAllocation.initialExecutors defaults to 
> spark.dynamicAllocation.minExecutors), and the number of tasks in the first 
> stage of your job is less than or equal to this min/init number of executors, 
> dynamic allocation won't actually request any executors and will just hang 
> indefinitely with the warning "Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources".
> The cause appears to be that ExecutorAllocationManager does not request any 
> executors while the application is still initializing, but it still sets the 
> initial value of numExecutorsTarget to 
> spark.dynamicAllocation.initialExecutors. Once the job is running and has 
> submitted its first task, if the first task does not need more than 
> spark.dynamicAllocation.initialExecutors, 
> ExecutorAllocationManager.updateAndSyncNumExecutorsTarget() does not think 
> that it needs to request any executors, so it doesn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11362) Use Spark BitSet in BroadcastNestedLoopJoin

2015-11-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11362:

Fix Version/s: (was: 1.7.0)
   1.6.0

> Use Spark BitSet in BroadcastNestedLoopJoin
> ---
>
> Key: SPARK-11362
> URL: https://issues.apache.org/jira/browse/SPARK-11362
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 1.6.0
>
>
> We use scala.collection.mutable.BitSet in BroadcastNestedLoopJoin now. We 
> should use Spark's BitSet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10565) New /api/v1/[path] APIs don't contain as much information as original /json API

2015-11-09 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-10565.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> New /api/v1/[path] APIs don't contain as much information as original /json 
> API 
> 
>
> Key: SPARK-10565
> URL: https://issues.apache.org/jira/browse/SPARK-10565
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API
>Affects Versions: 1.5.0
>Reporter: Kevin Chen
>Assignee: Charles Yeh
> Fix For: 1.6.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> [SPARK-3454] introduced official json APIs at /api/v1/[path] for data that 
> originally appeared only on the web UI. However, it does not expose all the 
> information on the web UI or on the previous unofficial endpoint at /json.
> For example, the APIs at /api/v1/[path] do not show the number of cores or 
> amount of memory per slave for each job. This is stored in 
> ApplicationInfo.desc.maxCores and ApplicationInfo.desc.memoryPerSlave, 
> respectively. This information would be useful to expose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10783) Do track the pointer array in UnsafeInMemorySorter

2015-11-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10783.

   Resolution: Fixed
 Assignee: Davies Liu
Fix Version/s: 1.6.0

Fixed by https://github.com/apache/spark/pull/9241

> Do track the pointer array in UnsafeInMemorySorter
> --
>
> Key: SPARK-10783
> URL: https://issues.apache.org/jira/browse/SPARK-10783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Andrew Or
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.6.0
>
>
> SPARK-10474 (https://github.com/apache/spark/pull/) removed the pointer 
> array tracking because `TungstenAggregate` would fail under memory pressure. 
> However, this is somewhat of a hack that we should fix in the right way in 
> 1.6.0 to ensure we don't OOM because of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11151) Use Long internally for DecimalType with precision <= 18

2015-11-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-11151:
---
Target Version/s:   (was: 1.6.0)

> Use Long internally for DecimalType with precision <= 18
> 
>
> Key: SPARK-11151
> URL: https://issues.apache.org/jira/browse/SPARK-11151
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> It's expensive to create a Decimal object for small, we could use Long 
> directly, just like what we had done for Date and Timestamp.
> This will involved lots of change that including:
> 1) inbound/outbound conversion
> 2) access/storage in InternalRow
> 3) all the expression that support DecimalType



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6728) Improve performance of py4j for large bytearray

2015-11-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-6728.
-
Resolution: Won't Fix

This can not be fixed without changes in Py4j, but this is not in the roadmap 
of Py4j yet.

> Improve performance of py4j for large bytearray
> ---
>
> Key: SPARK-6728
> URL: https://issues.apache.org/jira/browse/SPARK-6728
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Davies Liu
>Priority: Critical
>
> PySpark relies on py4j to transfer function arguments and return between 
> Python and JVM, it's very slow to pass a large bytearray (larger than 10M). 
> In MLlib, it's possible to have a Vector with more than 100M bytes, which 
> will need few GB memory, may crash.
> The reason is that py4j use text protocol, it will encode the bytearray as 
> base64, and do multiple string concat. 
> Binary will help a lot, create a issue for py4j: 
> https://github.com/bartdag/py4j/issues/159



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11089) Add a option for thrift-server to share a single session across all connections

2015-11-09 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997042#comment-14997042
 ] 

Davies Liu commented on SPARK-11089:


[~lian cheng] Could you help to look into this one? This is the backward 
compatible mode for 1.6, or the UDF may not work across connections (for the 
applications that use short connection, one connection per request). 

> Add a option for thrift-server to share a single session across all 
> connections
> ---
>
> Key: SPARK-11089
> URL: https://issues.apache.org/jira/browse/SPARK-11089
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
>
> In 1.6, we improve the session support in JDBC server by separating temporary 
> tables and UDFs. In some cases, user may still want to share the temporary 
> tables or UDFs across different applications.
> We should have an option or config to support that (use the original 
> SQLContext instead of calling newSession if it's set to true).
> cc [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10425) Add a regression test for SPARK-10379

2015-11-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-10425.
--
Resolution: Won't Fix

Can't reproduce the failure (easily)

> Add a regression test for SPARK-10379
> -
>
> Key: SPARK-10425
> URL: https://issues.apache.org/jira/browse/SPARK-10425
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10424) ShuffleHashOuterJoin should consider condition

2015-11-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-10424.
--
Resolution: Invalid

ShuffleHashOuterJoin actually support conpdition

> ShuffleHashOuterJoin should consider condition
> --
>
> Key: SPARK-10424
> URL: https://issues.apache.org/jira/browse/SPARK-10424
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Priority: Blocker
>
> Currently, ShuffleHashOuterJoin does not consider condition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10565) New /api/v1/[path] APIs don't contain as much information as original /json API

2015-11-09 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-10565:
-
Assignee: Charles Yeh

> New /api/v1/[path] APIs don't contain as much information as original /json 
> API 
> 
>
> Key: SPARK-10565
> URL: https://issues.apache.org/jira/browse/SPARK-10565
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API
>Affects Versions: 1.5.0
>Reporter: Kevin Chen
>Assignee: Charles Yeh
> Fix For: 1.6.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> [SPARK-3454] introduced official json APIs at /api/v1/[path] for data that 
> originally appeared only on the web UI. However, it does not expose all the 
> information on the web UI or on the previous unofficial endpoint at /json.
> For example, the APIs at /api/v1/[path] do not show the number of cores or 
> amount of memory per slave for each job. This is stored in 
> ApplicationInfo.desc.maxCores and ApplicationInfo.desc.memoryPerSlave, 
> respectively. This information would be useful to expose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11587) SparkR can not use summary.glm from base R

2015-11-09 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997035#comment-14997035
 ] 

Shivaram Venkataraman commented on SPARK-11587:
---

cc [~sunrui] We should fix this - We have a number of other DataFrame methods 
which override things correctly.

> SparkR can not use summary.glm from base R
> --
>
> Key: SPARK-11587
> URL: https://issues.apache.org/jira/browse/SPARK-11587
> Project: Spark
>  Issue Type: Bug
>  Components: ML, R, SparkR
>Reporter: Yanbo Liang
>
> When we use summary method of base R(not method of SparkR) in SparkR console:
> {code}
> model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris)
> summary(model)
> {code}
> It returns
> {code}
> Error in (function (classes, fdef, mtable)  : 
>   unable to find an inherited method for function ‘summary’ for signature 
> ‘"glm”’
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11587) SparkR can not use summary.glm from base R

2015-11-09 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-11587:
--
Component/s: SparkR

> SparkR can not use summary.glm from base R
> --
>
> Key: SPARK-11587
> URL: https://issues.apache.org/jira/browse/SPARK-11587
> Project: Spark
>  Issue Type: Bug
>  Components: ML, R, SparkR
>Reporter: Yanbo Liang
>
> When we use summary method of base R(not method of SparkR) in SparkR console:
> {code}
> model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris)
> summary(model)
> {code}
> It returns
> {code}
> Error in (function (classes, fdef, mtable)  : 
>   unable to find an inherited method for function ‘summary’ for signature 
> ‘"glm”’
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11089) Add a option for thrift-server to share a single session across all connections

2015-11-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-11089:
--

Assignee: Davies Liu

> Add a option for thrift-server to share a single session across all 
> connections
> ---
>
> Key: SPARK-11089
> URL: https://issues.apache.org/jira/browse/SPARK-11089
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> In 1.6, we improve the session support in JDBC server by separating temporary 
> tables and UDFs. In some cases, user may still want to share the temporary 
> tables or UDFs across different applications.
> We should have an option or config to support that (use the original 
> SQLContext instead of calling newSession if it's set to true).
> cc [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11560) Optimize KMeans implementation

2015-11-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997103#comment-14997103
 ] 

Joseph K. Bradley commented on SPARK-11560:
---

Do we want to keep the implementation for the Pipelines API?  We had worked on 
stacking models for linear methods (to do many runs at once) to amortize 
overhead, and this is the same kind of effort.  It should be helpful in some 
problem domains.  Has there been evidence that it's rarely useful?

> Optimize KMeans implementation
> --
>
> Key: SPARK-11560
> URL: https://issues.apache.org/jira/browse/SPARK-11560
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.7.0
>Reporter: Xiangrui Meng
>
> After we dropped `runs`, we can simplify and optimize the k-means 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11597) improve performance of array and map encoder

2015-11-09 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-11597:
---

 Summary: improve performance of array and map encoder
 Key: SPARK-11597
 URL: https://issues.apache.org/jira/browse/SPARK-11597
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11475) DataFrame API saveAsTable() does not work well for HDFS HA

2015-11-09 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997119#comment-14997119
 ] 

Rekha Joshi commented on SPARK-11475:
-

glad, thanks for confirming [~zhangxiongfei]

> DataFrame API saveAsTable() does not work well for HDFS HA
> --
>
> Key: SPARK-11475
> URL: https://issues.apache.org/jira/browse/SPARK-11475
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: Hadoop 2.4 & Spark 1.5.1
>Reporter: zhangxiongfei
> Attachments: dataFrame_saveAsTable.txt, hdfs-site.xml, hive-site.xml
>
>
> I was trying to save a DF to Hive using following code:
> {quote}
> sqlContext.range(1L,1000L,2L,2).coalesce(1).saveAsTable("dataframeTable")
> {quote}
> But got below exception:
> {quote}
> arning: there were 1 deprecation warning(s); re-run with -deprecation for 
> details
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1610)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1193)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3516)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:785)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(
> {quote}
> *My Hive configuration is* :
> {quote}
>
>   hive.metastore.warehouse.dir
>   */apps/hive/warehouse*
> 
> {quote}
> It seems that the hdfs HA is not configured,then I tried below code:
> {quote}
> sqlContext.range(1L,1000L,2L,2).coalesce(1).saveAsParquetFile("hdfs://bitautodmp/apps/hive/warehouse/dataframeTable")
> {quote}
> I could verified that  API *saveAsParquetFile* worked well by following 
> commands:
> {quote}
> *hadoop fs -ls /apps/hive/warehouse/dataframeTable*
> Found 4 items
> -rw-r--r--   3 zhangxf hdfs  0 2015-11-03 17:57 
> */apps/hive/warehouse/dataframeTable/_SUCCESS*
> -rw-r--r--   3 zhangxf hdfs199 2015-11-03 17:57 
> */apps/hive/warehouse/dataframeTable/_common_metadata*
> -rw-r--r--   3 zhangxf hdfs325 2015-11-03 17:57 
> */apps/hive/warehouse/dataframeTable/_metadata*
> -rw-r--r--   3 zhangxf hdfs   1098 2015-11-03 17:57 
> */apps/hive/warehouse/dataframeTable/part-r-0-a05a9bf3-b2a6-40e5-b180-818efb2a0f54.gz.parquet*
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11598) Add tests for ShuffledHashOuterJoin

2015-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11598:


Assignee: Davies Liu  (was: Apache Spark)

> Add tests for ShuffledHashOuterJoin
> ---
>
> Key: SPARK-11598
> URL: https://issues.apache.org/jira/browse/SPARK-11598
> Project: Spark
>  Issue Type: Test
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> We only test the default algorithm (SortMergeOuterJoin) for outer join, 
> ShuffledHashOuterJoin is not well tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11362) Use Spark BitSet in BroadcastNestedLoopJoin

2015-11-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11362:

Assignee: Liang-Chi Hsieh

> Use Spark BitSet in BroadcastNestedLoopJoin
> ---
>
> Key: SPARK-11362
> URL: https://issues.apache.org/jira/browse/SPARK-11362
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 1.6.0
>
>
> We use scala.collection.mutable.BitSet in BroadcastNestedLoopJoin now. We 
> should use Spark's BitSet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >