[jira] [Commented] (SPARK-11595) "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/"
[ https://issues.apache.org/jira/browse/SPARK-11595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996611#comment-14996611 ] Apache Spark commented on SPARK-11595: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/9570 > "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" > and "hdfs:/" > --- > > Key: SPARK-11595 > URL: https://issues.apache.org/jira/browse/SPARK-11595 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > > When handling {{ADD JAR}}, Spark constructs a {{java.io.File}} first using > the input jar path, and then converts it into a URL > ([here|https://github.com/apache/spark/pull/8909/files#diff-d613c921507243c65591c003a348f5f3R180]). > This works file for local file path without a URL scheme (e.g. > {{/tmp/a.jar}}). However, {{java.io.File.toURI}} returns unexpected result > when given a path containing a URL scheme (e.g. {{file:///tmp/a.jar}} or > {{hdfs://host:9000/path/to/a.jar}}): > {noformat} > scala> new java.io.File("file:///tmp/file").toURI > res1: java.net.URI = > file:/Users/lian/local/src/spark/workspace-a/file:/tmp/file > {noformat} > The consequence is that, although the {{ADD JAR}} command doesn't fail > immediately, the jar is actually not added properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7334) Implement RandomProjection for Dimensionality Reduction
[ https://issues.apache.org/jira/browse/SPARK-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996470#comment-14996470 ] Sebastian Alfers commented on SPARK-7334: - It this still relevant? [~josephkb] I saw a discussion about LSH here: https://issues.apache.org/jira/browse/SPARK-5992 > Implement RandomProjection for Dimensionality Reduction > --- > > Key: SPARK-7334 > URL: https://issues.apache.org/jira/browse/SPARK-7334 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Sebastian Alfers >Priority: Minor > > Implement RandomProjection (RP) for dimensionality reduction > RP is a popular approach to reduce the amount of data while preserving a > reasonable amount of information (pairwise distance) of you data [1][2] > - [1] http://www.yaroslavvb.com/papers/achlioptas-database.pdf > - [2] > http://people.inf.elte.hu/fekete/algoritmusok_msc/dimenzio_csokkentes/randon_projection_kdd.pdf > I compared different implementations of that algorithm: > - https://github.com/sebastian-alfers/random-projection-python -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11595) "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/"
[ https://issues.apache.org/jira/browse/SPARK-11595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11595: Assignee: Apache Spark (was: Cheng Lian) > "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" > and "hdfs:/" > --- > > Key: SPARK-11595 > URL: https://issues.apache.org/jira/browse/SPARK-11595 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Cheng Lian >Assignee: Apache Spark >Priority: Blocker > > When handling {{ADD JAR}}, PR #8909 constructs a {{java.io.File}} first using > the input jar path, and then converts it into a URL > ([here|https://github.com/apache/spark/pull/8909/files#diff-d613c921507243c65591c003a348f5f3R180]). > This works file for local file path without a URL scheme (e.g. > {{/tmp/a.jar}}). However, {{java.io.File.toURI}} returns unexpected result > when given a path containing a URL scheme (e.g. {{file:///tmp/a.jar}} or > {{hdfs://host:9000/path/to/a.jar}}): > {noformat} > scala> new java.io.File("file:///tmp/file").toURI > res1: java.net.URI = > file:/Users/lian/local/src/spark/workspace-a/file:/tmp/file > {noformat} > The consequence is that, although the {{ADD JAR}} command doesn't fail > immediately, the jar is actually not added properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11590) use native json_tuple in lateral view
[ https://issues.apache.org/jira/browse/SPARK-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996490#comment-14996490 ] Wenchen Fan commented on SPARK-11590: - https://github.com/apache/spark/pull/9562 > use native json_tuple in lateral view > - > > Key: SPARK-11590 > URL: https://issues.apache.org/jira/browse/SPARK-11590 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-11590) use native json_tuple in lateral view
[ https://issues.apache.org/jira/browse/SPARK-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-11590: Comment: was deleted (was: https://github.com/apache/spark/pull/9562) > use native json_tuple in lateral view > - > > Key: SPARK-11590 > URL: https://issues.apache.org/jira/browse/SPARK-11590 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11595) "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/"
[ https://issues.apache.org/jira/browse/SPARK-11595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11595: Assignee: Cheng Lian (was: Apache Spark) > "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" > and "hdfs:/" > --- > > Key: SPARK-11595 > URL: https://issues.apache.org/jira/browse/SPARK-11595 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > > When handling {{ADD JAR}}, PR #8909 constructs a {{java.io.File}} first using > the input jar path, and then converts it into a URL > ([here|https://github.com/apache/spark/pull/8909/files#diff-d613c921507243c65591c003a348f5f3R180]). > This works file for local file path without a URL scheme (e.g. > {{/tmp/a.jar}}). However, {{java.io.File.toURI}} returns unexpected result > when given a path containing a URL scheme (e.g. {{file:///tmp/a.jar}} or > {{hdfs://host:9000/path/to/a.jar}}): > {noformat} > scala> new java.io.File("file:///tmp/file").toURI > res1: java.net.URI = > file:/Users/lian/local/src/spark/workspace-a/file:/tmp/file > {noformat} > The consequence is that, although the {{ADD JAR}} command doesn't fail > immediately, the jar is actually not added properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11595) "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/"
[ https://issues.apache.org/jira/browse/SPARK-11595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996527#comment-14996527 ] Apache Spark commented on SPARK-11595: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/9569 > "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" > and "hdfs:/" > --- > > Key: SPARK-11595 > URL: https://issues.apache.org/jira/browse/SPARK-11595 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > > When handling {{ADD JAR}}, PR #8909 constructs a {{java.io.File}} first using > the input jar path, and then converts it into a URL > ([here|https://github.com/apache/spark/pull/8909/files#diff-d613c921507243c65591c003a348f5f3R180]). > This works file for local file path without a URL scheme (e.g. > {{/tmp/a.jar}}). However, {{java.io.File.toURI}} returns unexpected result > when given a path containing a URL scheme (e.g. {{file:///tmp/a.jar}} or > {{hdfs://host:9000/path/to/a.jar}}): > {noformat} > scala> new java.io.File("file:///tmp/file").toURI > res1: java.net.URI = > file:/Users/lian/local/src/spark/workspace-a/file:/tmp/file > {noformat} > The consequence is that, although the {{ADD JAR}} command doesn't fail > immediately, the jar is actually not added properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11530) Return eigenvalues with PCA model
[ https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christos Iraklis Tsatsoulis updated SPARK-11530: Component/s: MLlib > Return eigenvalues with PCA model > - > > Key: SPARK-11530 > URL: https://issues.apache.org/jira/browse/SPARK-11530 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.1 >Reporter: Christos Iraklis Tsatsoulis > > For data scientists & statisticians, PCA is of little use if they cannot > estimate the _proportion of variance explained_ by selecting _k_ principal > components (see here for the math details: > https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section > 'Explained variance'). To estimate this, one only needs the eigenvalues of > the covariance matrix. > Although the eigenvalues are currently computed during PCA model fitting, > they are not _returned_; hence, as it stands now, PCA in Spark ML is of > extremely limited practical use. > For details, see these SO questions > http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/ > (pyspark) > http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala) > and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996610#comment-14996610 ] Kristina Plazonic commented on SPARK-10309: --- Did anybody find a solution for this? I also get a lot of these errors (well into my job running, arghhh). > Some tasks failed with Unable to acquire memory > --- > > Key: SPARK-10309 > URL: https://issues.apache.org/jira/browse/SPARK-10309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Davies Liu > > *=== Update ===* > This is caused by a mismatch between > `Runtime.getRuntime.availableProcessors()` and the number of active tasks in > `ShuffleMemoryManager`. A quick reproduction is the following: > {code} > // My machine only has 8 cores > $ bin/spark-shell --master local[32] > scala> val df = sc.parallelize(Seq((1, 1), (2, 2))).toDF("a", "b") > scala> df.as("x").join(df.as("y"), $"x.a" === $"y.a").count() > Caused by: java.io.IOException: Unable to acquire 2097152 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:120) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$2.apply(sort.scala:143) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$2.apply(sort.scala:143) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.prepare(MapPartitionsWithPreparationRDD.scala:50) > {code} > *=== Original ===* > While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on > executor): > {code} > java.io.IOException: Unable to acquire 33554432 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The task could finished after retry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service
[ https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996519#comment-14996519 ] Cheng Lian commented on SPARK-11191: One of the problem here is SPARK-11595. However, after fixing SPARK-11595, {{CREATE TEMPORARY FUNCTION}} still doesn't work properly. Still investigating. > [1.5] Can't create UDF's using hive thrift service > -- > > Key: SPARK-11191 > URL: https://issues.apache.org/jira/browse/SPARK-11191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: David Ross >Priority: Blocker > > Since upgrading to spark 1.5 we've been unable to create and use UDF's when > we run in thrift server mode. > Our setup: > We start the thrift-server running against yarn in client mode, (we've also > built our own spark from github branch-1.5 with the following args: {{-Pyarn > -Phive -Phive-thrifeserver}} > If i run the following after connecting via JDBC (in this case via beeline): > {{add jar 'hdfs://path/to/jar"}} > (this command succeeds with no errors) > {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}} > (this command succeeds with no errors) > {{select testUDF(col1) from table1;}} > I get the following error in the logs: > {code} > org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 > pos 8 > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53) > at scala.util.Try.getOrElse(Try.scala:77) > at > org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > {code} > (cutting the bulk for ease of report, more than happy to send the full output) > {code} > 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive > query: > org.apache.hive.service.cli.HiveSQLException: > org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 > pos 100 > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at >
[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.
[ https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996667#comment-14996667 ] Michel Lemay commented on SPARK-10528: -- Looks like the problem is still there with precompiled binaries 1.5.1 I created and chmod using winutils.exe as explained earlier: from /tmp: winutils.exe ls hive drwxrwxrwx Furthermore, hadoop LocalFileSystem does not seems to be able to change permissions as shown here: `import org.apache.hadoop.fs._ val path = new Path("file:/tmp/hive") val lfs = FileSystem.get(path.toUri(), sc.hadoopConfiguration) lfs.getFileStatus(path).getPermission()` Shows: res0: org.apache.hadoop.fs.permission.FsPermission = rw-rw-rw- `lfs.setPermission(path, new org.apache.hadoop.fs.permission.FsPermission(0777.toShort)) lfs.getFileStatus(new Path("file:/tmp/hive")).getPermission()` Still shows rw-rw-rw- > spark-shell throws java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. > -- > > Key: SPARK-10528 > URL: https://issues.apache.org/jira/browse/SPARK-10528 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.5.0 > Environment: Windows 7 x64 >Reporter: Aliaksei Belablotski >Priority: Minor > > Starting spark-shell throws > java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11218) `./sbin/start-slave.sh --help` should print out the help message
[ https://issues.apache.org/jira/browse/SPARK-11218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11218. --- Resolution: Fixed Fix Version/s: 1.6.0 1.7.0 Issue resolved by pull request 9432 [https://github.com/apache/spark/pull/9432] > `./sbin/start-slave.sh --help` should print out the help message > > > Key: SPARK-11218 > URL: https://issues.apache.org/jira/browse/SPARK-11218 > Project: Spark > Issue Type: Bug > Components: Deploy >Reporter: Jacek Laskowski >Priority: Minor > Fix For: 1.7.0, 1.6.0 > > > Reading the sources has showed that the command {{./sbin/start-slave.sh > --help}} should print out the help message. It doesn't really. > {code} > ➜ spark git:(master) ✗ ./sbin/start-slave.sh --help > starting org.apache.spark.deploy.worker.Worker, logging to > /Users/jacek/dev/oss/spark/sbin/../logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out > failed to launch org.apache.spark.deploy.worker.Worker: > --properties-file FILE Path to a custom Spark properties file. > Default is conf/spark-defaults.conf. > full log in > /Users/jacek/dev/oss/spark/sbin/../logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11595) "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/"
Cheng Lian created SPARK-11595: -- Summary: "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/" Key: SPARK-11595 URL: https://issues.apache.org/jira/browse/SPARK-11595 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2, 1.6.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker When handling {{ADD JAR}}, PR #8909 constructs a {{java.io.File}} first using the input jar path, and then converts it into a URL ([here|https://github.com/apache/spark/pull/8909/files#diff-d613c921507243c65591c003a348f5f3R180]). This works file for local file path without a URL scheme (e.g. {{/tmp/a.jar}}). However, {{java.io.File.toURI}} returns unexpected result when given a path containing a URL scheme (e.g. {{file:///tmp/a.jar}} or {{hdfs://host:9000/path/to/a.jar}}): {noformat} scala> new java.io.File("file:///tmp/file").toURI res1: java.net.URI = file:/Users/lian/local/src/spark/workspace-a/file:/tmp/file {noformat} The consequence is that, although the {{ADD JAR}} command doesn't fail immediately, the jar is actually not added properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11594) Cannot create UDAF in REPL
[ https://issues.apache.org/jira/browse/SPARK-11594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11594: Assignee: (was: Apache Spark) > Cannot create UDAF in REPL > -- > > Key: SPARK-11594 > URL: https://issues.apache.org/jira/browse/SPARK-11594 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1, 1.6.0 > Environment: Latest Spark Master > JVM 1.8.0_66-b17 >Reporter: Herman van Hovell >Priority: Minor > > If you try to define the a UDAF in the REPL, an internal error is thrown by > Java. The following code for example: > {noformat} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{DataType, LongType, StructType} > import org.apache.spark.sql.expressions.{MutableAggregationBuffer, > UserDefinedAggregateFunction} > class LongProductSum extends UserDefinedAggregateFunction { > def inputSchema: StructType = new StructType() > .add("a", LongType) > .add("b", LongType) > def bufferSchema: StructType = new StructType() > .add("product", LongType) > def dataType: DataType = LongType > def deterministic: Boolean = true > def initialize(buffer: MutableAggregationBuffer): Unit = { > buffer(0) = 0L > } > def update(buffer: MutableAggregationBuffer, input: Row): Unit = { > if (!(input.isNullAt(0) || input.isNullAt(1))) { > buffer(0) = buffer.getLong(0) + input.getLong(0) * input.getLong(1) > } > } > def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { > buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0) > } > def evaluate(buffer: Row): Any = > buffer.getLong(0) > } > sqlContext.udf.register("longProductSum", new LongProductSum) > val data2 = Seq[(Integer, Integer, Integer)]( > (1, 10, -10), > (null, -60, 60), > (1, 30, -30), > (1, 30, 30), > (2, 1, 1), > (3, null, null)).toDF("key", "value1", "value2") > data2.registerTempTable("agg2") > val q = sqlContext.sql(""" > |SELECT > | key, > | count(distinct value1, value2), > | longProductSum(distinct value1, value2) > |FROM agg2 > |GROUP BY key > """.stripMargin) > q.show > {noformat} > Will throw the following error: > {noformat} > java.lang.InternalError: Malformed class name > at java.lang.Class.getSimpleName(Class.java:1330) > at > org.apache.spark.sql.execution.aggregate.ScalaUDAF.toString(udaf.scala:455) > at > org.apache.spark.sql.execution.SparkStrategies$Aggregation$$anonfun$9.apply(SparkStrategies.scala:211) > at > org.apache.spark.sql.execution.SparkStrategies$Aggregation$$anonfun$9.apply(SparkStrategies.scala:209) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:209) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > at > org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:445) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:51) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:49) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:56) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:56) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2092) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1419) > at
[jira] [Assigned] (SPARK-11594) Cannot create UDAF in REPL
[ https://issues.apache.org/jira/browse/SPARK-11594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11594: Assignee: Apache Spark > Cannot create UDAF in REPL > -- > > Key: SPARK-11594 > URL: https://issues.apache.org/jira/browse/SPARK-11594 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1, 1.6.0 > Environment: Latest Spark Master > JVM 1.8.0_66-b17 >Reporter: Herman van Hovell >Assignee: Apache Spark >Priority: Minor > > If you try to define the a UDAF in the REPL, an internal error is thrown by > Java. The following code for example: > {noformat} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{DataType, LongType, StructType} > import org.apache.spark.sql.expressions.{MutableAggregationBuffer, > UserDefinedAggregateFunction} > class LongProductSum extends UserDefinedAggregateFunction { > def inputSchema: StructType = new StructType() > .add("a", LongType) > .add("b", LongType) > def bufferSchema: StructType = new StructType() > .add("product", LongType) > def dataType: DataType = LongType > def deterministic: Boolean = true > def initialize(buffer: MutableAggregationBuffer): Unit = { > buffer(0) = 0L > } > def update(buffer: MutableAggregationBuffer, input: Row): Unit = { > if (!(input.isNullAt(0) || input.isNullAt(1))) { > buffer(0) = buffer.getLong(0) + input.getLong(0) * input.getLong(1) > } > } > def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { > buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0) > } > def evaluate(buffer: Row): Any = > buffer.getLong(0) > } > sqlContext.udf.register("longProductSum", new LongProductSum) > val data2 = Seq[(Integer, Integer, Integer)]( > (1, 10, -10), > (null, -60, 60), > (1, 30, -30), > (1, 30, 30), > (2, 1, 1), > (3, null, null)).toDF("key", "value1", "value2") > data2.registerTempTable("agg2") > val q = sqlContext.sql(""" > |SELECT > | key, > | count(distinct value1, value2), > | longProductSum(distinct value1, value2) > |FROM agg2 > |GROUP BY key > """.stripMargin) > q.show > {noformat} > Will throw the following error: > {noformat} > java.lang.InternalError: Malformed class name > at java.lang.Class.getSimpleName(Class.java:1330) > at > org.apache.spark.sql.execution.aggregate.ScalaUDAF.toString(udaf.scala:455) > at > org.apache.spark.sql.execution.SparkStrategies$Aggregation$$anonfun$9.apply(SparkStrategies.scala:211) > at > org.apache.spark.sql.execution.SparkStrategies$Aggregation$$anonfun$9.apply(SparkStrategies.scala:209) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:209) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > at > org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:445) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:51) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:49) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:56) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:56) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2092) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1419) >
[jira] [Commented] (SPARK-11594) Cannot create UDAF in REPL
[ https://issues.apache.org/jira/browse/SPARK-11594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996518#comment-14996518 ] Apache Spark commented on SPARK-11594: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/9568 > Cannot create UDAF in REPL > -- > > Key: SPARK-11594 > URL: https://issues.apache.org/jira/browse/SPARK-11594 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1, 1.6.0 > Environment: Latest Spark Master > JVM 1.8.0_66-b17 >Reporter: Herman van Hovell >Priority: Minor > > If you try to define the a UDAF in the REPL, an internal error is thrown by > Java. The following code for example: > {noformat} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{DataType, LongType, StructType} > import org.apache.spark.sql.expressions.{MutableAggregationBuffer, > UserDefinedAggregateFunction} > class LongProductSum extends UserDefinedAggregateFunction { > def inputSchema: StructType = new StructType() > .add("a", LongType) > .add("b", LongType) > def bufferSchema: StructType = new StructType() > .add("product", LongType) > def dataType: DataType = LongType > def deterministic: Boolean = true > def initialize(buffer: MutableAggregationBuffer): Unit = { > buffer(0) = 0L > } > def update(buffer: MutableAggregationBuffer, input: Row): Unit = { > if (!(input.isNullAt(0) || input.isNullAt(1))) { > buffer(0) = buffer.getLong(0) + input.getLong(0) * input.getLong(1) > } > } > def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { > buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0) > } > def evaluate(buffer: Row): Any = > buffer.getLong(0) > } > sqlContext.udf.register("longProductSum", new LongProductSum) > val data2 = Seq[(Integer, Integer, Integer)]( > (1, 10, -10), > (null, -60, 60), > (1, 30, -30), > (1, 30, 30), > (2, 1, 1), > (3, null, null)).toDF("key", "value1", "value2") > data2.registerTempTable("agg2") > val q = sqlContext.sql(""" > |SELECT > | key, > | count(distinct value1, value2), > | longProductSum(distinct value1, value2) > |FROM agg2 > |GROUP BY key > """.stripMargin) > q.show > {noformat} > Will throw the following error: > {noformat} > java.lang.InternalError: Malformed class name > at java.lang.Class.getSimpleName(Class.java:1330) > at > org.apache.spark.sql.execution.aggregate.ScalaUDAF.toString(udaf.scala:455) > at > org.apache.spark.sql.execution.SparkStrategies$Aggregation$$anonfun$9.apply(SparkStrategies.scala:211) > at > org.apache.spark.sql.execution.SparkStrategies$Aggregation$$anonfun$9.apply(SparkStrategies.scala:209) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:209) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > at > org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:445) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:51) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:49) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:56) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:56) > at
[jira] [Comment Edited] (SPARK-11530) Return eigenvalues with PCA model
[ https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996533#comment-14996533 ] Christos Iraklis Tsatsoulis edited comment on SPARK-11530 at 11/9/15 1:40 PM: -- I edited it to target both; there are `PCA.scala` scripts for both ML & MLLib, but since I am using it via PySpark, where it is available only via ML, I initially omitted MLlib was (Author: ctsats): I edited it to target both; there are PCA.scala scripts for both ML & MLLib, but since I am using it via PySpark, where it is available only via ML, I initially omitted MLlib > Return eigenvalues with PCA model > - > > Key: SPARK-11530 > URL: https://issues.apache.org/jira/browse/SPARK-11530 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.1 >Reporter: Christos Iraklis Tsatsoulis > > For data scientists & statisticians, PCA is of little use if they cannot > estimate the _proportion of variance explained_ by selecting _k_ principal > components (see here for the math details: > https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section > 'Explained variance'). To estimate this, one only needs the eigenvalues of > the covariance matrix. > Although the eigenvalues are currently computed during PCA model fitting, > they are not _returned_; hence, as it stands now, PCA in Spark ML is of > extremely limited practical use. > For details, see these SO questions > http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/ > (pyspark) > http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala) > and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11218) `./sbin/start-slave.sh --help` should print out the help message
[ https://issues.apache.org/jira/browse/SPARK-11218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11218: -- Assignee: Charles Yeh > `./sbin/start-slave.sh --help` should print out the help message > > > Key: SPARK-11218 > URL: https://issues.apache.org/jira/browse/SPARK-11218 > Project: Spark > Issue Type: Bug > Components: Deploy >Reporter: Jacek Laskowski >Assignee: Charles Yeh >Priority: Minor > Fix For: 1.6.0, 1.7.0 > > > Reading the sources has showed that the command {{./sbin/start-slave.sh > --help}} should print out the help message. It doesn't really. > {code} > ➜ spark git:(master) ✗ ./sbin/start-slave.sh --help > starting org.apache.spark.deploy.worker.Worker, logging to > /Users/jacek/dev/oss/spark/sbin/../logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out > failed to launch org.apache.spark.deploy.worker.Worker: > --properties-file FILE Path to a custom Spark properties file. > Default is conf/spark-defaults.conf. > full log in > /Users/jacek/dev/oss/spark/sbin/../logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11530) Return eigenvalues with PCA model
[ https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996533#comment-14996533 ] Christos Iraklis Tsatsoulis edited comment on SPARK-11530 at 11/9/15 1:45 PM: -- I edited it to target both; there are PCA.scala scripts for both ML & MLLib, but since I am using it via PySpark, where it is available only via ML, I initially omitted MLlib was (Author: ctsats): I edited it to target both; there are `PCA.scala` scripts for both ML & MLLib, but since I am using it via PySpark, where it is available only via ML, I initially omitted MLlib > Return eigenvalues with PCA model > - > > Key: SPARK-11530 > URL: https://issues.apache.org/jira/browse/SPARK-11530 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.1 >Reporter: Christos Iraklis Tsatsoulis > > For data scientists & statisticians, PCA is of little use if they cannot > estimate the _proportion of variance explained_ by selecting _k_ principal > components (see here for the math details: > https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section > 'Explained variance'). To estimate this, one only needs the eigenvalues of > the covariance matrix. > Although the eigenvalues are currently computed during PCA model fitting, > they are not _returned_; hence, as it stands now, PCA in Spark ML is of > extremely limited practical use. > For details, see these SO questions > http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/ > (pyspark) > http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala) > and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11530) Return eigenvalues with PCA model
[ https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996533#comment-14996533 ] Christos Iraklis Tsatsoulis edited comment on SPARK-11530 at 11/9/15 1:50 PM: -- I edited it to target both; there are PCA.scala scripts for both ML & MLlib, but since I am using it via PySpark, where it is available only via ML, I initially omitted MLlib. was (Author: ctsats): I edited it to target both; there are PCA.scala scripts for both ML & MLLib, but since I am using it via PySpark, where it is available only via ML, I initially omitted MLlib > Return eigenvalues with PCA model > - > > Key: SPARK-11530 > URL: https://issues.apache.org/jira/browse/SPARK-11530 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.1 >Reporter: Christos Iraklis Tsatsoulis > > For data scientists & statisticians, PCA is of little use if they cannot > estimate the _proportion of variance explained_ by selecting _k_ principal > components (see here for the math details: > https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section > 'Explained variance'). To estimate this, one only needs the eigenvalues of > the covariance matrix. > Although the eigenvalues are currently computed during PCA model fitting, > they are not _returned_; hence, as it stands now, PCA in Spark ML is of > extremely limited practical use. > For details, see these SO questions > http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/ > (pyspark) > http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala) > and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11595) "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/"
[ https://issues.apache.org/jira/browse/SPARK-11595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11595: --- Description: When handling {{ADD JAR}}, Spark constructs a {{java.io.File}} first using the input jar path, and then converts it into a URL ([here|https://github.com/apache/spark/pull/8909/files#diff-d613c921507243c65591c003a348f5f3R180]). This works file for local file path without a URL scheme (e.g. {{/tmp/a.jar}}). However, {{java.io.File.toURI}} returns unexpected result when given a path containing a URL scheme (e.g. {{file:///tmp/a.jar}} or {{hdfs://host:9000/path/to/a.jar}}): {noformat} scala> new java.io.File("file:///tmp/file").toURI res1: java.net.URI = file:/Users/lian/local/src/spark/workspace-a/file:/tmp/file {noformat} The consequence is that, although the {{ADD JAR}} command doesn't fail immediately, the jar is actually not added properly. was: When handling {{ADD JAR}}, PR #8909 constructs a {{java.io.File}} first using the input jar path, and then converts it into a URL ([here|https://github.com/apache/spark/pull/8909/files#diff-d613c921507243c65591c003a348f5f3R180]). This works file for local file path without a URL scheme (e.g. {{/tmp/a.jar}}). However, {{java.io.File.toURI}} returns unexpected result when given a path containing a URL scheme (e.g. {{file:///tmp/a.jar}} or {{hdfs://host:9000/path/to/a.jar}}): {noformat} scala> new java.io.File("file:///tmp/file").toURI res1: java.net.URI = file:/Users/lian/local/src/spark/workspace-a/file:/tmp/file {noformat} The consequence is that, although the {{ADD JAR}} command doesn't fail immediately, the jar is actually not added properly. > "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" > and "hdfs:/" > --- > > Key: SPARK-11595 > URL: https://issues.apache.org/jira/browse/SPARK-11595 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > > When handling {{ADD JAR}}, Spark constructs a {{java.io.File}} first using > the input jar path, and then converts it into a URL > ([here|https://github.com/apache/spark/pull/8909/files#diff-d613c921507243c65591c003a348f5f3R180]). > This works file for local file path without a URL scheme (e.g. > {{/tmp/a.jar}}). However, {{java.io.File.toURI}} returns unexpected result > when given a path containing a URL scheme (e.g. {{file:///tmp/a.jar}} or > {{hdfs://host:9000/path/to/a.jar}}): > {noformat} > scala> new java.io.File("file:///tmp/file").toURI > res1: java.net.URI = > file:/Users/lian/local/src/spark/workspace-a/file:/tmp/file > {noformat} > The consequence is that, although the {{ADD JAR}} command doesn't fail > immediately, the jar is actually not added properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11530) Return eigenvalues with PCA model
[ https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996533#comment-14996533 ] Christos Iraklis Tsatsoulis commented on SPARK-11530: - I edited it to target both; there are ``PCA.scala`` scripts for both ML & MLLib, but since I am using it via PySpark, where it is available only via ML, I initially omitted MLlib > Return eigenvalues with PCA model > - > > Key: SPARK-11530 > URL: https://issues.apache.org/jira/browse/SPARK-11530 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.1 >Reporter: Christos Iraklis Tsatsoulis > > For data scientists & statisticians, PCA is of little use if they cannot > estimate the _proportion of variance explained_ by selecting _k_ principal > components (see here for the math details: > https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section > 'Explained variance'). To estimate this, one only needs the eigenvalues of > the covariance matrix. > Although the eigenvalues are currently computed during PCA model fitting, > they are not _returned_; hence, as it stands now, PCA in Spark ML is of > extremely limited practical use. > For details, see these SO questions > http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/ > (pyspark) > http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala) > and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11530) Return eigenvalues with PCA model
[ https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996533#comment-14996533 ] Christos Iraklis Tsatsoulis edited comment on SPARK-11530 at 11/9/15 1:37 PM: -- I edited it to target both; there are PCA.scala scripts for both ML & MLLib, but since I am using it via PySpark, where it is available only via ML, I initially omitted MLlib was (Author: ctsats): I edited it to target both; there are ``PCA.scala`` scripts for both ML & MLLib, but since I am using it via PySpark, where it is available only via ML, I initially omitted MLlib > Return eigenvalues with PCA model > - > > Key: SPARK-11530 > URL: https://issues.apache.org/jira/browse/SPARK-11530 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.1 >Reporter: Christos Iraklis Tsatsoulis > > For data scientists & statisticians, PCA is of little use if they cannot > estimate the _proportion of variance explained_ by selecting _k_ principal > components (see here for the math details: > https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section > 'Explained variance'). To estimate this, one only needs the eigenvalues of > the covariance matrix. > Although the eigenvalues are currently computed during PCA model fitting, > they are not _returned_; hence, as it stands now, PCA in Spark ML is of > extremely limited practical use. > For details, see these SO questions > http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/ > (pyspark) > http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala) > and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10280) Add @since annotation to pyspark.ml.classification
[ https://issues.apache.org/jira/browse/SPARK-10280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10280. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8690 [https://github.com/apache/spark/pull/8690] > Add @since annotation to pyspark.ml.classification > -- > > Key: SPARK-10280 > URL: https://issues.apache.org/jira/browse/SPARK-10280 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Xiangrui Meng >Assignee: Yu Ishikawa >Priority: Minor > Labels: starter > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9297) covar_pop and covar_samp aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9297: Target Version/s: (was: 1.6.0) > covar_pop and covar_samp aggregate functions > > > Key: SPARK-9297 > URL: https://issues.apache.org/jira/browse/SPARK-9297 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7841) Spark build should not use lib_managed for dependencies
[ https://issues.apache.org/jira/browse/SPARK-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7841: --- Assignee: Apache Spark > Spark build should not use lib_managed for dependencies > --- > > Key: SPARK-7841 > URL: https://issues.apache.org/jira/browse/SPARK-7841 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.1 >Reporter: Iulian Dragos >Assignee: Apache Spark > Labels: easyfix, sbt > > - unnecessary duplication (I will have those libraries under ./m2, via maven > anyway) > - every time I call make-distribution I lose lib_managed (via mvn clean > install) and have to wait to download again all jars next time I use sbt > - Eclipse does not handle relative paths very well (source attachments from > lib_managed don’t always work) > - it's not the default configuration. If we stray from defaults I think there > should be a clear advantage. > Digging through history, the only reference to `retrieveManaged := true` I > found was in f686e3d, from July 2011 ("Initial work on converting build to > SBT 0.10.1"). My guess this is purely an accident of porting the build form > Sbt 0.7.x and trying to keep the old project layout. > If there are reasons for keeping it, please comment (I didn't get any answers > on the [dev mailing > list|http://apache-spark-developers-list.1001551.n3.nabble.com/Why-use-quot-lib-managed-quot-for-the-Sbt-build-td12361.html]) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7841) Spark build should not use lib_managed for dependencies
[ https://issues.apache.org/jira/browse/SPARK-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7841: --- Assignee: (was: Apache Spark) > Spark build should not use lib_managed for dependencies > --- > > Key: SPARK-7841 > URL: https://issues.apache.org/jira/browse/SPARK-7841 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.1 >Reporter: Iulian Dragos > Labels: easyfix, sbt > > - unnecessary duplication (I will have those libraries under ./m2, via maven > anyway) > - every time I call make-distribution I lose lib_managed (via mvn clean > install) and have to wait to download again all jars next time I use sbt > - Eclipse does not handle relative paths very well (source attachments from > lib_managed don’t always work) > - it's not the default configuration. If we stray from defaults I think there > should be a clear advantage. > Digging through history, the only reference to `retrieveManaged := true` I > found was in f686e3d, from July 2011 ("Initial work on converting build to > SBT 0.10.1"). My guess this is purely an accident of porting the build form > Sbt 0.7.x and trying to keep the old project layout. > If there are reasons for keeping it, please comment (I didn't get any answers > on the [dev mailing > list|http://apache-spark-developers-list.1001551.n3.nabble.com/Why-use-quot-lib-managed-quot-for-the-Sbt-build-td12361.html]) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7841) Spark build should not use lib_managed for dependencies
[ https://issues.apache.org/jira/browse/SPARK-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997489#comment-14997489 ] Apache Spark commented on SPARK-7841: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/9575 > Spark build should not use lib_managed for dependencies > --- > > Key: SPARK-7841 > URL: https://issues.apache.org/jira/browse/SPARK-7841 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.1 >Reporter: Iulian Dragos > Labels: easyfix, sbt > > - unnecessary duplication (I will have those libraries under ./m2, via maven > anyway) > - every time I call make-distribution I lose lib_managed (via mvn clean > install) and have to wait to download again all jars next time I use sbt > - Eclipse does not handle relative paths very well (source attachments from > lib_managed don’t always work) > - it's not the default configuration. If we stray from defaults I think there > should be a clear advantage. > Digging through history, the only reference to `retrieveManaged := true` I > found was in f686e3d, from July 2011 ("Initial work on converting build to > SBT 0.10.1"). My guess this is purely an accident of porting the build form > Sbt 0.7.x and trying to keep the old project layout. > If there are reasons for keeping it, please comment (I didn't get any answers > on the [dev mailing > list|http://apache-spark-developers-list.1001551.n3.nabble.com/Why-use-quot-lib-managed-quot-for-the-Sbt-build-td12361.html]) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11599) NPE when resolve a non-existent function
[ https://issues.apache.org/jira/browse/SPARK-11599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11599: Assignee: (was: Apache Spark) > NPE when resolve a non-existent function > > > Key: SPARK-11599 > URL: https://issues.apache.org/jira/browse/SPARK-11599 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Davies Liu > > {code} > java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.Registry.getFunctionInfo(Registry.java:254) > at > org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:466) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:59) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:55) > at scala.util.Try.getOrElse(Try.scala:77) > at > org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:55) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5$$anonfun$applyOrElse$21.apply(Analyzer.scala:527) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5$$anonfun$applyOrElse$21.apply(Analyzer.scala:527) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:526) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:523) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:228) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:228) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:233) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:233) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:250) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:280) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:233) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:75) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:85) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:89) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:89) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:93) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) >
[jira] [Assigned] (SPARK-11599) NPE when resolve a non-existent function
[ https://issues.apache.org/jira/browse/SPARK-11599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11599: Assignee: Apache Spark > NPE when resolve a non-existent function > > > Key: SPARK-11599 > URL: https://issues.apache.org/jira/browse/SPARK-11599 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Davies Liu >Assignee: Apache Spark > > {code} > java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.Registry.getFunctionInfo(Registry.java:254) > at > org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:466) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:59) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:55) > at scala.util.Try.getOrElse(Try.scala:77) > at > org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:55) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5$$anonfun$applyOrElse$21.apply(Analyzer.scala:527) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5$$anonfun$applyOrElse$21.apply(Analyzer.scala:527) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:526) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:523) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:228) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:228) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:233) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:233) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:250) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:280) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:233) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:75) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:85) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:89) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:89) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:93) > at
[jira] [Updated] (SPARK-11574) Spark should support StatsD sink out of box
[ https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaofeng Lin updated SPARK-11574: - Affects Version/s: (was: 1.5.2) 1.5.1 > Spark should support StatsD sink out of box > --- > > Key: SPARK-11574 > URL: https://issues.apache.org/jira/browse/SPARK-11574 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Xiaofeng Lin > > In order to run spark in production, monitoring is essential. StatsD is such > a common metric reporting mechanism that it should be supported out of the > box. This will enable publishing metrics to monitoring services like > datadog, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9301) collect_set and collect_list aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9301: Target Version/s: (was: 1.6.0) > collect_set and collect_list aggregate functions > > > Key: SPARK-9301 > URL: https://issues.apache.org/jira/browse/SPARK-9301 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Critical > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9300) histogram_numeric aggregate function
[ https://issues.apache.org/jira/browse/SPARK-9300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9300: Target Version/s: (was: 1.6.0) > histogram_numeric aggregate function > > > Key: SPARK-9300 > URL: https://issues.apache.org/jira/browse/SPARK-9300 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9299) percentile and percentile_approx aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9299: Target Version/s: (was: 1.6.0) > percentile and percentile_approx aggregate functions > > > Key: SPARK-9299 > URL: https://issues.apache.org/jira/browse/SPARK-9299 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11599) NPE when resolve a non-existent function
[ https://issues.apache.org/jira/browse/SPARK-11599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997501#comment-14997501 ] Apache Spark commented on SPARK-11599: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/9576 > NPE when resolve a non-existent function > > > Key: SPARK-11599 > URL: https://issues.apache.org/jira/browse/SPARK-11599 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Davies Liu > > {code} > java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.Registry.getFunctionInfo(Registry.java:254) > at > org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:466) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:59) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:55) > at scala.util.Try.getOrElse(Try.scala:77) > at > org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:55) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5$$anonfun$applyOrElse$21.apply(Analyzer.scala:527) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5$$anonfun$applyOrElse$21.apply(Analyzer.scala:527) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:526) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$11$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:523) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:228) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:228) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:233) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:233) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:250) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:280) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:233) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:75) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:85) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:89) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:89) > at >
[jira] [Updated] (SPARK-11587) SparkR can not use summary.glm from base R
[ https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11587: -- Priority: Critical (was: Major) > SparkR can not use summary.glm from base R > -- > > Key: SPARK-11587 > URL: https://issues.apache.org/jira/browse/SPARK-11587 > Project: Spark > Issue Type: Bug > Components: ML, R, SparkR >Reporter: Yanbo Liang >Priority: Critical > > When we use summary method of base R(not method of SparkR) in SparkR console: > {code} > model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris) > summary(model) > {code} > It returns > {code} > Error in (function (classes, fdef, mtable) : > unable to find an inherited method for function ‘summary’ for signature > ‘"glm”’ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11326) Support for authentication and encryption in standalone mode
[ https://issues.apache.org/jira/browse/SPARK-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997448#comment-14997448 ] Patrick Wendell commented on SPARK-11326: - There are a few related conversations here: 1. The feature set of standalone scheduler and goals. The main goal of that scheduler is to make it easy for people to download and run Spark with minimal extra dependencies. The main difference between the standalone mode and other schedulers is that we aren't providing support for scheduling other frameworks than Spark (and likely never will). Other than that, features are added on a case-by-case basis depending on whether there is sufficient commitment from the maintainers to support the feature long term. 2. Security in non-YARN modes. I would actually like to see better support for security in other modes of Spark, the main reason being supporting the large number of users not inside of Hadoop deployments. BTW, I think the existing security architecture of Spark makes this possible, because the concern of distributing a shared secret is largely decoupled from the specific security mechanism. But we haven't really exposed public hooks for injecting secrets. There is also the question of secure job submission which is addressed in this JIRA. This needs some thought and probably makes sense to discuss on the Spark 1.7 timeframe. Overall I think some broader questions need to be answered, and it's something perhaps we can discuss once 1.6 is out the door as we think about 1.7. > Support for authentication and encryption in standalone mode > > > Key: SPARK-11326 > URL: https://issues.apache.org/jira/browse/SPARK-11326 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Jacek Lewandowski > > h3.The idea > Currently, in standalone mode, all components, for all network connections > need to use the same secure token if they want to have any security ensured. > This ticket is intended to split the communication in standalone mode to make > it more like in Yarn mode - application internal communication and scheduler > communication. > Such refactoring will allow for the scheduler (master, workers) to use a > distinct secret, which will remain unknown for the users. Similarly, it will > allow for better security in applications, because each application will be > able to use a distinct secret as well. > By providing SASL authentication/encryption for connections between a client > (Client or AppClient) and Spark Master, it becomes possible introducing > pluggable authentication for standalone deployment mode. > h3.Improvements introduced by this patch > This patch introduces the following changes: > * Spark driver or submission client do not have to use the same secret as > workers use to communicate with Master > * Master is able to authenticate individual clients with the following rules: > ** When connecting to the master, the client needs to specify > {{spark.authenticate.secret}} which is an authentication token for the user > specified by {{spark.authenticate.user}} ({{sparkSaslUser}} by default) > ** Master configuration may include additional > {{spark.authenticate.secrets.}} entries for specifying > authentication token for particular users or > {{spark.authenticate.authenticatorClass}} which specify an implementation of > external credentials provider (which is able to retrieve the authentication > token for a given user). > ** Workers authenticate with Master as default user {{sparkSaslUser}}. > * The authorization rules are as follows: > ** A regular user is able to manage only his own application (the application > which he submitted) > ** A regular user is not able to register or manager workers > ** Spark default user {{sparkSaslUser}} can manage all the applications > h3.User facing changes when running application > h4.General principles: > - conf: {{spark.authenticate.secret}} is *never sent* over the wire > - env: {{SPARK_AUTH_SECRET}} is *never sent* over the wire > - In all situations env variable will overwrite conf variable if present. > - In all situations when a user has to pass a secret, it is better (safer) to > do this through env variable > - In work modes with multiple secrets we assume encrypted communication > between client and master, between driver and master, between master and > workers > > h4.Work modes and descriptions > h5.Client mode, single secret > h6.Configuration > - env: {{SPARK_AUTH_SECRET=secret}} or conf: > {{spark.authenticate.secret=secret}} > h6.Description > - The driver is running locally > - The driver will neither send env: {{SPARK_AUTH_SECRET}} nor conf: > {{spark.authenticate.secret}} > - The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: >
[jira] [Resolved] (SPARK-11508) Add Python API for repartition and sortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-11508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-11508. -- Resolution: Fixed Assignee: Nong Li Fix Version/s: 1.6.0 It has been resolved by https://github.com/apache/spark/pull/9504. > Add Python API for repartition and sortWithinPartitions > --- > > Key: SPARK-11508 > URL: https://issues.apache.org/jira/browse/SPARK-11508 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Nong Li > Fix For: 1.6.0 > > > We added a few new methods in 1.6 that are still missing in Python: > {code} > def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame > def repartition(partitionExprs: Column*): DataFrame > def sortWithinPartitions(sortExprs: Column*): DataFrame > def sortWithinPartitions(sortCol: String, sortCols: String*): DataFrame > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11574) Spark should support StatsD sink out of box
[ https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaofeng Lin updated SPARK-11574: - Affects Version/s: 1.6.0 > Spark should support StatsD sink out of box > --- > > Key: SPARK-11574 > URL: https://issues.apache.org/jira/browse/SPARK-11574 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1, 1.6.0 >Reporter: Xiaofeng Lin > > In order to run spark in production, monitoring is essential. StatsD is such > a common metric reporting mechanism that it should be supported out of the > box. This will enable publishing metrics to monitoring services like > datadog, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11552) Replace example code in ml-decision-tree.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-11552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-11552. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9539 [https://github.com/apache/spark/pull/9539] > Replace example code in ml-decision-tree.md using include_example > - > > Key: SPARK-11552 > URL: https://issues.apache.org/jira/browse/SPARK-11552 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin > Labels: starter > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11552) Replace example code in ml-decision-tree.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-11552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11552: -- Target Version/s: 1.6.0 Priority: Minor (was: Major) Component/s: ML > Replace example code in ml-decision-tree.md using include_example > - > > Key: SPARK-11552 > URL: https://issues.apache.org/jira/browse/SPARK-11552 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xusen Yin >Assignee: sachin aggarwal >Priority: Minor > Labels: starter > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11462) Add JavaStreamingListener
[ https://issues.apache.org/jira/browse/SPARK-11462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-11462. --- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 1.6.0 > Add JavaStreamingListener > - > > Key: SPARK-11462 > URL: https://issues.apache.org/jira/browse/SPARK-11462 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > > Add Java friendly API for StreamingListener -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
swetha k created SPARK-11620: Summary: parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException Key: SPARK-11620 URL: https://issues.apache.org/jira/browse/SPARK-11620 Project: Spark Issue Type: Bug Reporter: swetha k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11587) SparkR can not use summary.glm from base R
[ https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997842#comment-14997842 ] Apache Spark commented on SPARK-11587: -- User 'shivaram' has created a pull request for this issue: https://github.com/apache/spark/pull/9582 > SparkR can not use summary.glm from base R > -- > > Key: SPARK-11587 > URL: https://issues.apache.org/jira/browse/SPARK-11587 > Project: Spark > Issue Type: Bug > Components: ML, R, SparkR >Reporter: Yanbo Liang >Priority: Critical > > When we use summary method of base R(not method of SparkR) in SparkR console: > {code} > model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris) > summary(model) > {code} > It returns > {code} > Error in (function (classes, fdef, mtable) : > unable to find an inherited method for function ‘summary’ for signature > ‘"glm”’ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11587) SparkR can not use summary.glm from base R
[ https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11587: Assignee: (was: Apache Spark) > SparkR can not use summary.glm from base R > -- > > Key: SPARK-11587 > URL: https://issues.apache.org/jira/browse/SPARK-11587 > Project: Spark > Issue Type: Bug > Components: ML, R, SparkR >Reporter: Yanbo Liang >Priority: Critical > > When we use summary method of base R(not method of SparkR) in SparkR console: > {code} > model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris) > summary(model) > {code} > It returns > {code} > Error in (function (classes, fdef, mtable) : > unable to find an inherited method for function ‘summary’ for signature > ‘"glm”’ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11587) SparkR can not use summary.glm from base R
[ https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11587: Assignee: Apache Spark > SparkR can not use summary.glm from base R > -- > > Key: SPARK-11587 > URL: https://issues.apache.org/jira/browse/SPARK-11587 > Project: Spark > Issue Type: Bug > Components: ML, R, SparkR >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Critical > > When we use summary method of base R(not method of SparkR) in SparkR console: > {code} > model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris) > summary(model) > {code} > It returns > {code} > Error in (function (classes, fdef, mtable) : > unable to find an inherited method for function ‘summary’ for signature > ‘"glm”’ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11611) Python API for bisecting k-means
[ https://issues.apache.org/jira/browse/SPARK-11611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997875#comment-14997875 ] Yu Ishikawa commented on SPARK-11611: - [~mengxr] can we change the target version from 1.7.0 to 1.6.0? > Python API for bisecting k-means > > > Key: SPARK-11611 > URL: https://issues.apache.org/jira/browse/SPARK-11611 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Xiangrui Meng > > Implement Python API for bisecting k-means. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-8506) SparkR does not provide an easy way to depend on Spark Packages when performing init from inside of R
[ https://issues.apache.org/jira/browse/SPARK-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Girish Reddy updated SPARK-8506: Comment: was deleted (was: Hi [~holdenk] - I am getting an error when specifying multiple packages with a comma separating them. Is there an example showing how multiple packages can be specified in the argument?) > SparkR does not provide an easy way to depend on Spark Packages when > performing init from inside of R > - > > Key: SPARK-8506 > URL: https://issues.apache.org/jira/browse/SPARK-8506 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.4.0 >Reporter: holdenk >Assignee: holdenk >Priority: Minor > Fix For: 1.4.1, 1.5.0 > > > While packages can be specified when using the sparkR or sparkSubmit scripts, > the programming guide tells people to create their spark context using the R > shell + init. The init does have a parameter for jars but no parameter for > packages. Setting the SPARKR_SUBMIT_ARGS overwrites some necessary > information. I think a good solution would just be adding another field to > the init function to allow people to specify packages in the same way as jars. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
[ https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998015#comment-14998015 ] swetha k edited comment on SPARK-11620 at 11/10/15 6:07 AM: I see the following Warning message when I use parquet-avro in my Spark Batch. Following is the dependency that I use. com.twitter parquet-avro 1.6.0 Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could not write summary file for active_sessions_current parquet.io.ParquetEncodingException: maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all the files must be contained in the root active_sessions_current at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) was (Author: swethakasireddy): I see the following Warning message when I use parquet-avro. Following is the dependency that I use. com.twitter parquet-avro 1.6.0 Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could not write summary file for active_sessions_current parquet.io.ParquetEncodingException: maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all the files must be contained in the root active_sessions_current at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) > parquet.hadoop.ParquetOutputCommitter.commitJob() throws > parquet.io.ParquetEncodingException > > > Key: SPARK-11620 > URL: https://issues.apache.org/jira/browse/SPARK-11620 > Project: Spark > Issue Type: Bug >Reporter: swetha k > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
[ https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998015#comment-14998015 ] swetha k commented on SPARK-11620: -- I see the following Warning message when I use parquet-avro. Following is the dependency that I use. com.twitter parquet-avro 1.6.0 Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could not write summary file for active_sessions_current parquet.io.ParquetEncodingException: maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all the files must be contained in the root active_sessions_current at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) > parquet.hadoop.ParquetOutputCommitter.commitJob() throws > parquet.io.ParquetEncodingException > > > Key: SPARK-11620 > URL: https://issues.apache.org/jira/browse/SPARK-11620 > Project: Spark > Issue Type: Bug >Reporter: swetha k > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11618) Refactoring of basic ML import/export
[ https://issues.apache.org/jira/browse/SPARK-11618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998016#comment-14998016 ] Apache Spark commented on SPARK-11618: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/9587 > Refactoring of basic ML import/export > - > > Key: SPARK-11618 > URL: https://issues.apache.org/jira/browse/SPARK-11618 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > This is for a few updates to the original PR for basic ML import/export in > [SPARK-11217]. > * The original PR diverges from the design doc in that it does not include > the Spark version or a model format version. We should include the Spark > version in the metadata. If we do that, then we don't really need a model > format version. > * Proposal: DefaultParamsWriter includes two separable pieces of logic in > save(): (a) handling overwriting and (b) saving Params. I want to separate > these by putting (a) in a save() method in Writer which calls an abstract > saveImpl, and (b) in the saveImpl implementation in DefaultParamsWriter. > This is described below: > {code} > abstract class Writer { > def save(path: String) = { > // handle overwrite > saveImpl(path) > } > def saveImpl(path: String) // abstract > } > class DefaultParamsWriter extends Writer { > def saveImpl(path: String) = { > // save Params > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11141) Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes
[ https://issues.apache.org/jira/browse/SPARK-11141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-11141. --- Resolution: Fixed Assignee: Burak Yavuz Fix Version/s: 1.6.0 > Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes > -- > > Key: SPARK-11141 > URL: https://issues.apache.org/jira/browse/SPARK-11141 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 1.6.0 > > > When using S3 as a directory for WALs, the writes take too long. The driver > gets very easily bottlenecked when multiple receivers send AddBlock events to > the ReceiverTracker. This PR adds batching of events in the > ReceivedBlockTracker so that receivers don't get blocked by the driver for > too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11580) Just do final aggregation when there is no Exchange
[ https://issues.apache.org/jira/browse/SPARK-11580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11580: - Target Version/s: (was: 1.6.0) > Just do final aggregation when there is no Exchange > --- > > Key: SPARK-11580 > URL: https://issues.apache.org/jira/browse/SPARK-11580 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Yadong Qi > > I do the SQL as below: > {code} > cache table src as select * from src distribute by key; > select key, count(value) from src group by key; > {code} > and the Physical Plan is > {code} > TungstenAggregate(key=[key#0], > functions=[(count(value#1),mode=Final,isDistinct=false)], > output=[key#0,_c1#28L]) > TungstenAggregate(key=[key#0], > functions=[(count(value#1),mode=Partial,isDistinct=false)], > output=[key#0,currentCount#41L]) > InMemoryColumnarTableScan [key#0,value#1], (InMemoryRelation > [key#0,value#1], true, 1, StorageLevel(true, true, false, true, 1), > (TungstenExchange hashpartitioning(key#0)), Some(src)) > {code} > I think if there is no *Exchange*, just do final aggregation is better, like: > {code} > TungstenAggregate(key=[key#0], > functions=[(count(value#1),mode=Final,isDistinct=false)], > output=[key#0,_c1#28L]) > InMemoryColumnarTableScan [key#0,value#1], (InMemoryRelation > [key#0,value#1], true, 1, StorageLevel(true, true, false, true, 1), > (TungstenExchange hashpartitioning(key#0)), Some(src)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8506) SparkR does not provide an easy way to depend on Spark Packages when performing init from inside of R
[ https://issues.apache.org/jira/browse/SPARK-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998000#comment-14998000 ] Girish Reddy commented on SPARK-8506: - Hi [~holdenk] - I am getting an error when specifying multiple packages with a comma separating them. Is there an example showing how multiple packages can be specified in the argument? > SparkR does not provide an easy way to depend on Spark Packages when > performing init from inside of R > - > > Key: SPARK-8506 > URL: https://issues.apache.org/jira/browse/SPARK-8506 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.4.0 >Reporter: holdenk >Assignee: holdenk >Priority: Minor > Fix For: 1.4.1, 1.5.0 > > > While packages can be specified when using the sparkR or sparkSubmit scripts, > the programming guide tells people to create their spark context using the R > shell + init. The init does have a parameter for jars but no parameter for > packages. Setting the SPARKR_SUBMIT_ARGS overwrites some necessary > information. I think a good solution would just be adding another field to > the init function to allow people to specify packages in the same way as jars. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7675) PySpark spark.ml Params type conversions
[ https://issues.apache.org/jira/browse/SPARK-7675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7675: --- Assignee: Apache Spark > PySpark spark.ml Params type conversions > > > Key: SPARK-7675 > URL: https://issues.apache.org/jira/browse/SPARK-7675 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > Currently, PySpark wrappers for spark.ml Scala classes are brittle when > accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an > integer); it must be set to "2.0" (a float). Fixing this is not trivial > since there does not appear to be a natural place to insert the conversion > before Python wrappers call Java's Params setter method. > A possible fix will be to include a method "_checkType" to PySpark's Param > class which checks the type, prints an error if needed, and converts types > when relevant (e.g., int to float, or scipy matrix to array). The Java > wrapper method which copies params to Scala can call this method when > available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7675) PySpark spark.ml Params type conversions
[ https://issues.apache.org/jira/browse/SPARK-7675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997814#comment-14997814 ] Apache Spark commented on SPARK-7675: - User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/9581 > PySpark spark.ml Params type conversions > > > Key: SPARK-7675 > URL: https://issues.apache.org/jira/browse/SPARK-7675 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > > Currently, PySpark wrappers for spark.ml Scala classes are brittle when > accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an > integer); it must be set to "2.0" (a float). Fixing this is not trivial > since there does not appear to be a natural place to insert the conversion > before Python wrappers call Java's Params setter method. > A possible fix will be to include a method "_checkType" to PySpark's Param > class which checks the type, prints an error if needed, and converts types > when relevant (e.g., int to float, or scipy matrix to array). The Java > wrapper method which copies params to Scala can call this method when > available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7675) PySpark spark.ml Params type conversions
[ https://issues.apache.org/jira/browse/SPARK-7675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7675: --- Assignee: (was: Apache Spark) > PySpark spark.ml Params type conversions > > > Key: SPARK-7675 > URL: https://issues.apache.org/jira/browse/SPARK-7675 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > > Currently, PySpark wrappers for spark.ml Scala classes are brittle when > accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an > integer); it must be set to "2.0" (a float). Fixing this is not trivial > since there does not appear to be a natural place to insert the conversion > before Python wrappers call Java's Params setter method. > A possible fix will be to include a method "_checkType" to PySpark's Param > class which checks the type, prints an error if needed, and converts types > when relevant (e.g., int to float, or scipy matrix to array). The Java > wrapper method which copies params to Scala can call this method when > available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11618) Refactoring of basic ML import/export
Joseph K. Bradley created SPARK-11618: - Summary: Refactoring of basic ML import/export Key: SPARK-11618 URL: https://issues.apache.org/jira/browse/SPARK-11618 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley This is for a few updates to the original PR for basic ML import/export in [SPARK-11217]. * The original PR diverges from the design doc in that it does not include the Spark version or a model format version. We should include the Spark version in the metadata. If we do that, then we don't really need a model format version. * Proposal: DefaultParamsWriter includes two separable pieces of logic in save(): (a) handling overwriting and (b) saving Params. I want to separate these by putting (a) in a save() method in Writer which calls an abstract saveImpl, and (b) in the saveImpl implementation in DefaultParamsWriter. This is described below: {code} abstract class Writer { def save(path: String) = { // handle overwrite saveImpl(path) } def saveImpl(path: String) // abstract } class DefaultParamsWriter extends Writer { def saveImpl(path: String) = { // save Params } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11621) ORC filter pushdown not working properly after new unhandled filter interface.
[ https://issues.apache.org/jira/browse/SPARK-11621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998014#comment-14998014 ] Hyukjin Kwon commented on SPARK-11621: -- I would like to work this. > ORC filter pushdown not working properly after new unhandled filter interface. > -- > > Key: SPARK-11621 > URL: https://issues.apache.org/jira/browse/SPARK-11621 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Hyukjin Kwon > > After we get the new interface to get rid of filters predicate-push-downed > which are processed in datasource-level > (https://github.com/apache/spark/pull/9399), it dose not push down filters > for ORC. > This is because at {{DataSourceStrategy}}, it is classified to scanning > non-partitioned HadoopFsRelation, and all the filters are treated as > unhandled filters. > Also, since ORC does not support to filter fully record by record but instead > rough results came out, the filters for ORC should not go to unhandled > filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11621) ORC filter pushdown not working properly after new unhandled filter interface.
Hyukjin Kwon created SPARK-11621: Summary: ORC filter pushdown not working properly after new unhandled filter interface. Key: SPARK-11621 URL: https://issues.apache.org/jira/browse/SPARK-11621 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Hyukjin Kwon After we get the new interface to get rid of filters predicate-push-downed which are processed in datasource-level (https://github.com/apache/spark/pull/9399), it dose not push down filters for ORC. This is because at {{DataSourceStrategy}}, it is classified to scanning non-partitioned HadoopFsRelation, and all the filters are treated as unhandled filters. Also, since ORC does not support to filter fully record by record but instead rough results came out, the filters for ORC should not go to unhandled filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11587) SparkR can not use summary.glm from base R
[ https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-11587. --- Resolution: Fixed Fix Version/s: 1.6.0 Resolved by https://github.com/apache/spark/pull/9582 > SparkR can not use summary.glm from base R > -- > > Key: SPARK-11587 > URL: https://issues.apache.org/jira/browse/SPARK-11587 > Project: Spark > Issue Type: Bug > Components: ML, R, SparkR >Reporter: Yanbo Liang >Assignee: Shivaram Venkataraman >Priority: Critical > Fix For: 1.6.0 > > > When we use summary method of base R(not method of SparkR) in SparkR console: > {code} > model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris) > summary(model) > {code} > It returns > {code} > Error in (function (classes, fdef, mtable) : > unable to find an inherited method for function ‘summary’ for signature > ‘"glm”’ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11587) SparkR can not use summary.glm from base R
[ https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman reassigned SPARK-11587: - Assignee: Shivaram Venkataraman > SparkR can not use summary.glm from base R > -- > > Key: SPARK-11587 > URL: https://issues.apache.org/jira/browse/SPARK-11587 > Project: Spark > Issue Type: Bug > Components: ML, R, SparkR >Reporter: Yanbo Liang >Assignee: Shivaram Venkataraman >Priority: Critical > Fix For: 1.6.0 > > > When we use summary method of base R(not method of SparkR) in SparkR console: > {code} > model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris) > summary(model) > {code} > It returns > {code} > Error in (function (classes, fdef, mtable) : > unable to find an inherited method for function ‘summary’ for signature > ‘"glm”’ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11587) SparkR can not use summary.glm from base R
[ https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997967#comment-14997967 ] Shivaram Venkataraman commented on SPARK-11587: --- [~yanboliang] Let me know if the fix works as expected. > SparkR can not use summary.glm from base R > -- > > Key: SPARK-11587 > URL: https://issues.apache.org/jira/browse/SPARK-11587 > Project: Spark > Issue Type: Bug > Components: ML, R, SparkR >Reporter: Yanbo Liang >Assignee: Shivaram Venkataraman >Priority: Critical > Fix For: 1.6.0 > > > When we use summary method of base R(not method of SparkR) in SparkR console: > {code} > model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris) > summary(model) > {code} > It returns > {code} > Error in (function (classes, fdef, mtable) : > unable to find an inherited method for function ‘summary’ for signature > ‘"glm”’ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11617) MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected
LingZhou created SPARK-11617: Summary: MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected Key: SPARK-11617 URL: https://issues.apache.org/jira/browse/SPARK-11617 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.6.0 Reporter: LingZhou The problem may be related to [SPARK-11235][NETWORK] Add ability to stream data using network lib. while running on yarn-client mode, there are error messages: 15/11/09 10:23:55 ERROR util.ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. Enable advanced leak reporting to find out where the leak occurred. To enable advanced leak reporting, specify the JVM option '-Dio.netty.leakDetectionLevel=advanced' or call ResourceLeakDetector.setLevel() See http://netty.io/wiki/reference-counted-objects.html for more information. and then it will cause cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 9.0 GB of 9 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. and WARN scheduler.TaskSetManager: Lost task 105.0 in stage 1.0 (TID 2616, gsr489): java.lang.IndexOutOfBoundsException: index: 130828, length: 16833 (expected: range(0, 524288)). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11611) Python API for bisecting k-means
[ https://issues.apache.org/jira/browse/SPARK-11611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11611: Assignee: Apache Spark > Python API for bisecting k-means > > > Key: SPARK-11611 > URL: https://issues.apache.org/jira/browse/SPARK-11611 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Apache Spark > > Implement Python API for bisecting k-means. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11611) Python API for bisecting k-means
[ https://issues.apache.org/jira/browse/SPARK-11611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997872#comment-14997872 ] Apache Spark commented on SPARK-11611: -- User 'yu-iskw' has created a pull request for this issue: https://github.com/apache/spark/pull/9583 > Python API for bisecting k-means > > > Key: SPARK-11611 > URL: https://issues.apache.org/jira/browse/SPARK-11611 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Xiangrui Meng > > Implement Python API for bisecting k-means. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11619) cannot use UDTF in DataFrame.selectExpr
Wenchen Fan created SPARK-11619: --- Summary: cannot use UDTF in DataFrame.selectExpr Key: SPARK-11619 URL: https://issues.apache.org/jira/browse/SPARK-11619 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Priority: Minor Currently if use UDTF like `explode`, `json_tuple` in `DataFrame.selectExpr`, it will be parsed into `UnresolvedFunction` first, and then alias it with `expr.prettyString`. However, UDTF may need MultiAlias so we will get error if we run: {code} val df = Seq((Map("1" -> 1), 1)).toDF("a", "b") df.selectExpr("explode(a)").show() {code} [info] org.apache.spark.sql.AnalysisException: Expect multiple names given for org.apache.spark.sql.catalyst.expressions.Explode, [info] but only single name ''explode(a)' specified; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service
[ https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-11191: --- Target Version/s: 1.5.2, 1.5.3 (was: 1.5.2, 1.5.3, 1.6.0) > [1.5] Can't create UDF's using hive thrift service > -- > > Key: SPARK-11191 > URL: https://issues.apache.org/jira/browse/SPARK-11191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: David Ross >Priority: Blocker > > Since upgrading to spark 1.5 we've been unable to create and use UDF's when > we run in thrift server mode. > Our setup: > We start the thrift-server running against yarn in client mode, (we've also > built our own spark from github branch-1.5 with the following args: {{-Pyarn > -Phive -Phive-thrifeserver}} > If i run the following after connecting via JDBC (in this case via beeline): > {{add jar 'hdfs://path/to/jar"}} > (this command succeeds with no errors) > {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}} > (this command succeeds with no errors) > {{select testUDF(col1) from table1;}} > I get the following error in the logs: > {code} > org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 > pos 8 > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53) > at scala.util.Try.getOrElse(Try.scala:77) > at > org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > {code} > (cutting the bulk for ease of report, more than happy to send the full output) > {code} > 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive > query: > org.apache.hive.service.cli.HiveSQLException: > org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 > pos 100 > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > When I ran the same against 1.4 it worked. > I've
[jira] [Updated] (SPARK-6728) Improve performance of py4j for large bytearray
[ https://issues.apache.org/jira/browse/SPARK-6728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-6728: -- Target Version/s: (was: 1.6.0) > Improve performance of py4j for large bytearray > --- > > Key: SPARK-6728 > URL: https://issues.apache.org/jira/browse/SPARK-6728 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.3.0 >Reporter: Davies Liu >Priority: Critical > > PySpark relies on py4j to transfer function arguments and return between > Python and JVM, it's very slow to pass a large bytearray (larger than 10M). > In MLlib, it's possible to have a Vector with more than 100M bytes, which > will need few GB memory, may crash. > The reason is that py4j use text protocol, it will encode the bytearray as > base64, and do multiple string concat. > Binary will help a lot, create a issue for py4j: > https://github.com/bartdag/py4j/issues/159 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10538) java.lang.NegativeArraySizeException during join
[ https://issues.apache.org/jira/browse/SPARK-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997051#comment-14997051 ] Davies Liu commented on SPARK-10538: [~maver1ck] Could you reproduce this issue in master or 1.6 branch ? > java.lang.NegativeArraySizeException during join > > > Key: SPARK-10538 > URL: https://issues.apache.org/jira/browse/SPARK-10538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Maciej Bryński >Assignee: Davies Liu > Attachments: screenshot-1.png > > > Hi, > I've got a problem during joining tables in PySpark. (in my example 20 of > them) > I can observe that during calculation of first partition (on one of > consecutive joins) there is a big shuffle read size (294.7 MB / 146 records) > vs on others partitions (approx. 272.5 KB / 113 record) > I can also observe that just before the crash python process going up to few > gb of RAM. > After some time there is an exception: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90) > at > org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > I'm running this on 2 nodes cluster (12 cores, 64 GB RAM) > Config: > {code} > spark.driver.memory 10g > spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseParallelGC > -Dfile.encoding=UTF8 > spark.executor.memory 60g > spark.storage.memoryFraction0.05 > spark.shuffle.memoryFraction0.75 > spark.driver.maxResultSize 10g > spark.cores.max 24 > spark.kryoserializer.buffer.max 1g > spark.default.parallelism 200 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9865) Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame
[ https://issues.apache.org/jira/browse/SPARK-9865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-9865: - Assignee: Felix Cheung > Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame > - > > Key: SPARK-9865 > URL: https://issues.apache.org/jira/browse/SPARK-9865 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Davies Liu >Assignee: Felix Cheung > Fix For: 1.6.0 > > > 1. Failure (at test_sparkSQL.R#525): sample on a DataFrame > - > count(sampled3) < 3 isn't true > Error: Test failures > Execution halted > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1468/console -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9865) Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame
[ https://issues.apache.org/jira/browse/SPARK-9865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9865. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9549 [https://github.com/apache/spark/pull/9549] > Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame > - > > Key: SPARK-9865 > URL: https://issues.apache.org/jira/browse/SPARK-9865 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Davies Liu > Fix For: 1.6.0 > > > 1. Failure (at test_sparkSQL.R#525): sample on a DataFrame > - > count(sampled3) < 3 isn't true > Error: Test failures > Execution halted > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1468/console -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11319) PySpark silently Accepts null values in non-nullable DataFrame fields.
[ https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997100#comment-14997100 ] Daniel Jalova commented on SPARK-11319: --- I would like to work on this. > PySpark silently Accepts null values in non-nullable DataFrame fields. > -- > > Key: SPARK-11319 > URL: https://issues.apache.org/jira/browse/SPARK-11319 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Kevin Cox > > Running the following code with a null value in a non-nullable column > silently works. This makes the code incredibly hard to trust. > {code} > In [2]: from pyspark.sql.types import * > In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a", > TimestampType(), False)])).collect() > Out[3]: [Row(a=None)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11597) improve performance of array and map encoder
[ https://issues.apache.org/jira/browse/SPARK-11597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997127#comment-14997127 ] Apache Spark commented on SPARK-11597: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/9572 > improve performance of array and map encoder > > > Key: SPARK-11597 > URL: https://issues.apache.org/jira/browse/SPARK-11597 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11597) improve performance of array and map encoder
[ https://issues.apache.org/jira/browse/SPARK-11597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11597: Assignee: (was: Apache Spark) > improve performance of array and map encoder > > > Key: SPARK-11597 > URL: https://issues.apache.org/jira/browse/SPARK-11597 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11597) improve performance of array and map encoder
[ https://issues.apache.org/jira/browse/SPARK-11597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11597: Assignee: Apache Spark > improve performance of array and map encoder > > > Key: SPARK-11597 > URL: https://issues.apache.org/jira/browse/SPARK-11597 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11598) Add tests for ShuffledHashOuterJoin
Davies Liu created SPARK-11598: -- Summary: Add tests for ShuffledHashOuterJoin Key: SPARK-11598 URL: https://issues.apache.org/jira/browse/SPARK-11598 Project: Spark Issue Type: Test Reporter: Davies Liu Assignee: Davies Liu We only test the default algorithm (SortMergeOuterJoin) for outer join, ShuffledHashOuterJoin is not well tested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996706#comment-14996706 ] Narine Kokhlikyan commented on SPARK-5575: -- Hi [~avulanov] , I was trying out the current implementation of ANN and have one question about it. Usually, when I run neuronal network with other tools such as R, I can additionally see information about: e.g. Error, Reached Threshold and Steps. Can I also somehow get such information from Spark ANN ? Maybe it is already there, I couldn't find it. I looked through the implementations of GradientDecent and LBFGS and it seems that the optimizer.optimize doesn't return values about the error, number of iterations, etc. I might be wrong here, still investigating it, but, I'd be happy to hear from you regarding this. Thanks, Narine > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service
[ https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997028#comment-14997028 ] Davies Liu commented on SPARK-11191: This should work in master and 1.6. > [1.5] Can't create UDF's using hive thrift service > -- > > Key: SPARK-11191 > URL: https://issues.apache.org/jira/browse/SPARK-11191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: David Ross >Priority: Blocker > > Since upgrading to spark 1.5 we've been unable to create and use UDF's when > we run in thrift server mode. > Our setup: > We start the thrift-server running against yarn in client mode, (we've also > built our own spark from github branch-1.5 with the following args: {{-Pyarn > -Phive -Phive-thrifeserver}} > If i run the following after connecting via JDBC (in this case via beeline): > {{add jar 'hdfs://path/to/jar"}} > (this command succeeds with no errors) > {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}} > (this command succeeds with no errors) > {{select testUDF(col1) from table1;}} > I get the following error in the logs: > {code} > org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 > pos 8 > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53) > at scala.util.Try.getOrElse(Try.scala:77) > at > org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > {code} > (cutting the bulk for ease of report, more than happy to send the full output) > {code} > 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive > query: > org.apache.hive.service.cli.HiveSQLException: > org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 > pos 100 > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > When I ran the same against 1.4 it
[jira] [Commented] (SPARK-11373) Add metrics to the History Server and providers
[ https://issues.apache.org/jira/browse/SPARK-11373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997071#comment-14997071 ] Apache Spark commented on SPARK-11373: -- User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/9571 > Add metrics to the History Server and providers > --- > > Key: SPARK-11373 > URL: https://issues.apache.org/jira/browse/SPARK-11373 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Steve Loughran > > The History server doesn't publish metrics about JVM load or anything from > the history provider plugins. This means that performance problems from > massive job histories aren't visible to management tools, and nor are any > provider-generated metrics such as time to load histories, failed history > loads, the number of connectivity failures talking to remote services, etc. > If the history server set up a metrics registry and offered the option to > publish its metrics, then management tools could view this data. > # the metrics registry would need to be passed down to the instantiated > {{ApplicationHistoryProvider}}, in order for it to register its metrics. > # if the codahale metrics servlet were registered under a path such as > {{/metrics}}, the values would be visible as HTML and JSON, without the need > for management tools. > # Integration tests could also retrieve the JSON-formatted data and use it as > part of the test suites. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11373) Add metrics to the History Server and providers
[ https://issues.apache.org/jira/browse/SPARK-11373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11373: Assignee: Apache Spark > Add metrics to the History Server and providers > --- > > Key: SPARK-11373 > URL: https://issues.apache.org/jira/browse/SPARK-11373 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Steve Loughran >Assignee: Apache Spark > > The History server doesn't publish metrics about JVM load or anything from > the history provider plugins. This means that performance problems from > massive job histories aren't visible to management tools, and nor are any > provider-generated metrics such as time to load histories, failed history > loads, the number of connectivity failures talking to remote services, etc. > If the history server set up a metrics registry and offered the option to > publish its metrics, then management tools could view this data. > # the metrics registry would need to be passed down to the instantiated > {{ApplicationHistoryProvider}}, in order for it to register its metrics. > # if the codahale metrics servlet were registered under a path such as > {{/metrics}}, the values would be visible as HTML and JSON, without the need > for management tools. > # Integration tests could also retrieve the JSON-formatted data and use it as > part of the test suites. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10790) Dynamic Allocation does not request any executors if first stage needs less than or equal to spark.dynamicAllocation.initialExecutors
[ https://issues.apache.org/jira/browse/SPARK-10790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10790: -- Affects Version/s: (was: 1.5.1) 1.5.0 > Dynamic Allocation does not request any executors if first stage needs less > than or equal to spark.dynamicAllocation.initialExecutors > - > > Key: SPARK-10790 > URL: https://issues.apache.org/jira/browse/SPARK-10790 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.5.0 >Reporter: Jonathan Kelly >Assignee: Saisai Shao > Fix For: 1.5.2, 1.6.0 > > > If you set spark.dynamicAllocation.initialExecutors > 0 (or > spark.dynamicAllocation.minExecutors, since > spark.dynamicAllocation.initialExecutors defaults to > spark.dynamicAllocation.minExecutors), and the number of tasks in the first > stage of your job is less than or equal to this min/init number of executors, > dynamic allocation won't actually request any executors and will just hang > indefinitely with the warning "Initial job has not accepted any resources; > check your cluster UI to ensure that workers are registered and have > sufficient resources". > The cause appears to be that ExecutorAllocationManager does not request any > executors while the application is still initializing, but it still sets the > initial value of numExecutorsTarget to > spark.dynamicAllocation.initialExecutors. Once the job is running and has > submitted its first task, if the first task does not need more than > spark.dynamicAllocation.initialExecutors, > ExecutorAllocationManager.updateAndSyncNumExecutorsTarget() does not think > that it needs to request any executors, so it doesn't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10790) Dynamic Allocation does not request any executors if first stage needs less than or equal to spark.dynamicAllocation.initialExecutors
[ https://issues.apache.org/jira/browse/SPARK-10790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10790: -- Affects Version/s: (was: 1.5.0) 1.5.1 > Dynamic Allocation does not request any executors if first stage needs less > than or equal to spark.dynamicAllocation.initialExecutors > - > > Key: SPARK-10790 > URL: https://issues.apache.org/jira/browse/SPARK-10790 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.5.0 >Reporter: Jonathan Kelly >Assignee: Saisai Shao > Fix For: 1.5.2, 1.6.0 > > > If you set spark.dynamicAllocation.initialExecutors > 0 (or > spark.dynamicAllocation.minExecutors, since > spark.dynamicAllocation.initialExecutors defaults to > spark.dynamicAllocation.minExecutors), and the number of tasks in the first > stage of your job is less than or equal to this min/init number of executors, > dynamic allocation won't actually request any executors and will just hang > indefinitely with the warning "Initial job has not accepted any resources; > check your cluster UI to ensure that workers are registered and have > sufficient resources". > The cause appears to be that ExecutorAllocationManager does not request any > executors while the application is still initializing, but it still sets the > initial value of numExecutorsTarget to > spark.dynamicAllocation.initialExecutors. Once the job is running and has > submitted its first task, if the first task does not need more than > spark.dynamicAllocation.initialExecutors, > ExecutorAllocationManager.updateAndSyncNumExecutorsTarget() does not think > that it needs to request any executors, so it doesn't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11362) Use Spark BitSet in BroadcastNestedLoopJoin
[ https://issues.apache.org/jira/browse/SPARK-11362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-11362: Fix Version/s: (was: 1.7.0) 1.6.0 > Use Spark BitSet in BroadcastNestedLoopJoin > --- > > Key: SPARK-11362 > URL: https://issues.apache.org/jira/browse/SPARK-11362 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 1.6.0 > > > We use scala.collection.mutable.BitSet in BroadcastNestedLoopJoin now. We > should use Spark's BitSet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10565) New /api/v1/[path] APIs don't contain as much information as original /json API
[ https://issues.apache.org/jira/browse/SPARK-10565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-10565. -- Resolution: Fixed Fix Version/s: 1.6.0 > New /api/v1/[path] APIs don't contain as much information as original /json > API > > > Key: SPARK-10565 > URL: https://issues.apache.org/jira/browse/SPARK-10565 > Project: Spark > Issue Type: Improvement > Components: Input/Output, Java API >Affects Versions: 1.5.0 >Reporter: Kevin Chen >Assignee: Charles Yeh > Fix For: 1.6.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > [SPARK-3454] introduced official json APIs at /api/v1/[path] for data that > originally appeared only on the web UI. However, it does not expose all the > information on the web UI or on the previous unofficial endpoint at /json. > For example, the APIs at /api/v1/[path] do not show the number of cores or > amount of memory per slave for each job. This is stored in > ApplicationInfo.desc.maxCores and ApplicationInfo.desc.memoryPerSlave, > respectively. This information would be useful to expose. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10783) Do track the pointer array in UnsafeInMemorySorter
[ https://issues.apache.org/jira/browse/SPARK-10783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10783. Resolution: Fixed Assignee: Davies Liu Fix Version/s: 1.6.0 Fixed by https://github.com/apache/spark/pull/9241 > Do track the pointer array in UnsafeInMemorySorter > -- > > Key: SPARK-10783 > URL: https://issues.apache.org/jira/browse/SPARK-10783 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1, 1.6.0 >Reporter: Andrew Or >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.6.0 > > > SPARK-10474 (https://github.com/apache/spark/pull/) removed the pointer > array tracking because `TungstenAggregate` would fail under memory pressure. > However, this is somewhat of a hack that we should fix in the right way in > 1.6.0 to ensure we don't OOM because of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11151) Use Long internally for DecimalType with precision <= 18
[ https://issues.apache.org/jira/browse/SPARK-11151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-11151: --- Target Version/s: (was: 1.6.0) > Use Long internally for DecimalType with precision <= 18 > > > Key: SPARK-11151 > URL: https://issues.apache.org/jira/browse/SPARK-11151 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > It's expensive to create a Decimal object for small, we could use Long > directly, just like what we had done for Date and Timestamp. > This will involved lots of change that including: > 1) inbound/outbound conversion > 2) access/storage in InternalRow > 3) all the expression that support DecimalType -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6728) Improve performance of py4j for large bytearray
[ https://issues.apache.org/jira/browse/SPARK-6728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-6728. - Resolution: Won't Fix This can not be fixed without changes in Py4j, but this is not in the roadmap of Py4j yet. > Improve performance of py4j for large bytearray > --- > > Key: SPARK-6728 > URL: https://issues.apache.org/jira/browse/SPARK-6728 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.3.0 >Reporter: Davies Liu >Priority: Critical > > PySpark relies on py4j to transfer function arguments and return between > Python and JVM, it's very slow to pass a large bytearray (larger than 10M). > In MLlib, it's possible to have a Vector with more than 100M bytes, which > will need few GB memory, may crash. > The reason is that py4j use text protocol, it will encode the bytearray as > base64, and do multiple string concat. > Binary will help a lot, create a issue for py4j: > https://github.com/bartdag/py4j/issues/159 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11089) Add a option for thrift-server to share a single session across all connections
[ https://issues.apache.org/jira/browse/SPARK-11089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997042#comment-14997042 ] Davies Liu commented on SPARK-11089: [~lian cheng] Could you help to look into this one? This is the backward compatible mode for 1.6, or the UDF may not work across connections (for the applications that use short connection, one connection per request). > Add a option for thrift-server to share a single session across all > connections > --- > > Key: SPARK-11089 > URL: https://issues.apache.org/jira/browse/SPARK-11089 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Cheng Lian > > In 1.6, we improve the session support in JDBC server by separating temporary > tables and UDFs. In some cases, user may still want to share the temporary > tables or UDFs across different applications. > We should have an option or config to support that (use the original > SQLContext instead of calling newSession if it's set to true). > cc [~marmbrus] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10425) Add a regression test for SPARK-10379
[ https://issues.apache.org/jira/browse/SPARK-10425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-10425. -- Resolution: Won't Fix Can't reproduce the failure (easily) > Add a regression test for SPARK-10379 > - > > Key: SPARK-10425 > URL: https://issues.apache.org/jira/browse/SPARK-10425 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10424) ShuffleHashOuterJoin should consider condition
[ https://issues.apache.org/jira/browse/SPARK-10424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-10424. -- Resolution: Invalid ShuffleHashOuterJoin actually support conpdition > ShuffleHashOuterJoin should consider condition > -- > > Key: SPARK-10424 > URL: https://issues.apache.org/jira/browse/SPARK-10424 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Priority: Blocker > > Currently, ShuffleHashOuterJoin does not consider condition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10565) New /api/v1/[path] APIs don't contain as much information as original /json API
[ https://issues.apache.org/jira/browse/SPARK-10565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-10565: - Assignee: Charles Yeh > New /api/v1/[path] APIs don't contain as much information as original /json > API > > > Key: SPARK-10565 > URL: https://issues.apache.org/jira/browse/SPARK-10565 > Project: Spark > Issue Type: Improvement > Components: Input/Output, Java API >Affects Versions: 1.5.0 >Reporter: Kevin Chen >Assignee: Charles Yeh > Fix For: 1.6.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > [SPARK-3454] introduced official json APIs at /api/v1/[path] for data that > originally appeared only on the web UI. However, it does not expose all the > information on the web UI or on the previous unofficial endpoint at /json. > For example, the APIs at /api/v1/[path] do not show the number of cores or > amount of memory per slave for each job. This is stored in > ApplicationInfo.desc.maxCores and ApplicationInfo.desc.memoryPerSlave, > respectively. This information would be useful to expose. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11587) SparkR can not use summary.glm from base R
[ https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997035#comment-14997035 ] Shivaram Venkataraman commented on SPARK-11587: --- cc [~sunrui] We should fix this - We have a number of other DataFrame methods which override things correctly. > SparkR can not use summary.glm from base R > -- > > Key: SPARK-11587 > URL: https://issues.apache.org/jira/browse/SPARK-11587 > Project: Spark > Issue Type: Bug > Components: ML, R, SparkR >Reporter: Yanbo Liang > > When we use summary method of base R(not method of SparkR) in SparkR console: > {code} > model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris) > summary(model) > {code} > It returns > {code} > Error in (function (classes, fdef, mtable) : > unable to find an inherited method for function ‘summary’ for signature > ‘"glm”’ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11587) SparkR can not use summary.glm from base R
[ https://issues.apache.org/jira/browse/SPARK-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-11587: -- Component/s: SparkR > SparkR can not use summary.glm from base R > -- > > Key: SPARK-11587 > URL: https://issues.apache.org/jira/browse/SPARK-11587 > Project: Spark > Issue Type: Bug > Components: ML, R, SparkR >Reporter: Yanbo Liang > > When we use summary method of base R(not method of SparkR) in SparkR console: > {code} > model <- glm(Sepal.Width ~ Sepal.Length + Species, data = iris) > summary(model) > {code} > It returns > {code} > Error in (function (classes, fdef, mtable) : > unable to find an inherited method for function ‘summary’ for signature > ‘"glm”’ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11089) Add a option for thrift-server to share a single session across all connections
[ https://issues.apache.org/jira/browse/SPARK-11089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-11089: -- Assignee: Davies Liu > Add a option for thrift-server to share a single session across all > connections > --- > > Key: SPARK-11089 > URL: https://issues.apache.org/jira/browse/SPARK-11089 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > In 1.6, we improve the session support in JDBC server by separating temporary > tables and UDFs. In some cases, user may still want to share the temporary > tables or UDFs across different applications. > We should have an option or config to support that (use the original > SQLContext instead of calling newSession if it's set to true). > cc [~marmbrus] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11560) Optimize KMeans implementation
[ https://issues.apache.org/jira/browse/SPARK-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997103#comment-14997103 ] Joseph K. Bradley commented on SPARK-11560: --- Do we want to keep the implementation for the Pipelines API? We had worked on stacking models for linear methods (to do many runs at once) to amortize overhead, and this is the same kind of effort. It should be helpful in some problem domains. Has there been evidence that it's rarely useful? > Optimize KMeans implementation > -- > > Key: SPARK-11560 > URL: https://issues.apache.org/jira/browse/SPARK-11560 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.7.0 >Reporter: Xiangrui Meng > > After we dropped `runs`, we can simplify and optimize the k-means > implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11597) improve performance of array and map encoder
Wenchen Fan created SPARK-11597: --- Summary: improve performance of array and map encoder Key: SPARK-11597 URL: https://issues.apache.org/jira/browse/SPARK-11597 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11475) DataFrame API saveAsTable() does not work well for HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-11475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997119#comment-14997119 ] Rekha Joshi commented on SPARK-11475: - glad, thanks for confirming [~zhangxiongfei] > DataFrame API saveAsTable() does not work well for HDFS HA > -- > > Key: SPARK-11475 > URL: https://issues.apache.org/jira/browse/SPARK-11475 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: Hadoop 2.4 & Spark 1.5.1 >Reporter: zhangxiongfei > Attachments: dataFrame_saveAsTable.txt, hdfs-site.xml, hive-site.xml > > > I was trying to save a DF to Hive using following code: > {quote} > sqlContext.range(1L,1000L,2L,2).coalesce(1).saveAsTable("dataframeTable") > {quote} > But got below exception: > {quote} > arning: there were 1 deprecation warning(s); re-run with -deprecation for > details > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1610) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1193) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3516) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:785) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo( > {quote} > *My Hive configuration is* : > {quote} > > hive.metastore.warehouse.dir > */apps/hive/warehouse* > > {quote} > It seems that the hdfs HA is not configured,then I tried below code: > {quote} > sqlContext.range(1L,1000L,2L,2).coalesce(1).saveAsParquetFile("hdfs://bitautodmp/apps/hive/warehouse/dataframeTable") > {quote} > I could verified that API *saveAsParquetFile* worked well by following > commands: > {quote} > *hadoop fs -ls /apps/hive/warehouse/dataframeTable* > Found 4 items > -rw-r--r-- 3 zhangxf hdfs 0 2015-11-03 17:57 > */apps/hive/warehouse/dataframeTable/_SUCCESS* > -rw-r--r-- 3 zhangxf hdfs199 2015-11-03 17:57 > */apps/hive/warehouse/dataframeTable/_common_metadata* > -rw-r--r-- 3 zhangxf hdfs325 2015-11-03 17:57 > */apps/hive/warehouse/dataframeTable/_metadata* > -rw-r--r-- 3 zhangxf hdfs 1098 2015-11-03 17:57 > */apps/hive/warehouse/dataframeTable/part-r-0-a05a9bf3-b2a6-40e5-b180-818efb2a0f54.gz.parquet* > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11598) Add tests for ShuffledHashOuterJoin
[ https://issues.apache.org/jira/browse/SPARK-11598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11598: Assignee: Davies Liu (was: Apache Spark) > Add tests for ShuffledHashOuterJoin > --- > > Key: SPARK-11598 > URL: https://issues.apache.org/jira/browse/SPARK-11598 > Project: Spark > Issue Type: Test >Reporter: Davies Liu >Assignee: Davies Liu > > We only test the default algorithm (SortMergeOuterJoin) for outer join, > ShuffledHashOuterJoin is not well tested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11362) Use Spark BitSet in BroadcastNestedLoopJoin
[ https://issues.apache.org/jira/browse/SPARK-11362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-11362: Assignee: Liang-Chi Hsieh > Use Spark BitSet in BroadcastNestedLoopJoin > --- > > Key: SPARK-11362 > URL: https://issues.apache.org/jira/browse/SPARK-11362 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 1.6.0 > > > We use scala.collection.mutable.BitSet in BroadcastNestedLoopJoin now. We > should use Spark's BitSet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org