[jira] [Commented] (SPARK-13352) BlockFetch does not scale well on large block
[ https://issues.apache.org/jira/browse/SPARK-13352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245209#comment-15245209 ] Davies Liu commented on SPARK-13352: corrected, thanks > BlockFetch does not scale well on large block > - > > Key: SPARK-13352 > URL: https://issues.apache.org/jira/browse/SPARK-13352 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Reporter: Davies Liu >Assignee: Zhang, Liye >Priority: Critical > Fix For: 1.6.2, 2.0.0 > > > BlockManager.getRemoteBytes() perform poorly on large block > {code} > test("block manager") { > val N = 500 << 20 > val bm = sc.env.blockManager > val blockId = TaskResultBlockId(0) > val buffer = ByteBuffer.allocate(N) > buffer.limit(N) > bm.putBytes(blockId, buffer, StorageLevel.MEMORY_AND_DISK_SER) > val result = bm.getRemoteBytes(blockId) > assert(result.isDefined) > assert(result.get.limit() === (N)) > } > {code} > Here are runtime for different block sizes: > {code} > 50M3 seconds > 100M 7 seconds > 250M 33 seconds > 500M 2 min > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13352) BlockFetch does not scale well on large block
[ https://issues.apache.org/jira/browse/SPARK-13352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15234559#comment-15234559 ] Davies Liu edited comment on SPARK-13352 at 4/18/16 6:40 AM: - The result is much better now (there is some fixed overhead for tests): {code} 50M2.2 seconds 100M 2.8 seconds 250M 3.7 seconds 500M 7.8 seconds {code} was (Author: davies): The result is much better now (there is some fixed overhead for tests): {code} 50M2.2 seconds 100M 2.8 seconds 250M 3.7 seconds 500M 7.8 min {code} > BlockFetch does not scale well on large block > - > > Key: SPARK-13352 > URL: https://issues.apache.org/jira/browse/SPARK-13352 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Reporter: Davies Liu >Assignee: Zhang, Liye >Priority: Critical > Fix For: 1.6.2, 2.0.0 > > > BlockManager.getRemoteBytes() perform poorly on large block > {code} > test("block manager") { > val N = 500 << 20 > val bm = sc.env.blockManager > val blockId = TaskResultBlockId(0) > val buffer = ByteBuffer.allocate(N) > buffer.limit(N) > bm.putBytes(blockId, buffer, StorageLevel.MEMORY_AND_DISK_SER) > val result = bm.getRemoteBytes(blockId) > assert(result.isDefined) > assert(result.get.limit() === (N)) > } > {code} > Here are runtime for different block sizes: > {code} > 50M3 seconds > 100M 7 seconds > 250M 33 seconds > 500M 2 min > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14696) Needs implicit encoders for boxed primitive types
[ https://issues.apache.org/jira/browse/SPARK-14696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245204#comment-15245204 ] Apache Spark commented on SPARK-14696: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/12466 > Needs implicit encoders for boxed primitive types > - > > Key: SPARK-14696 > URL: https://issues.apache.org/jira/browse/SPARK-14696 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > We currently only have implicit encoders for scala primitive types. We should > also add implicit encoders for boxed primitives. Otherwise, the following > code would not have an encoder: > {code} > sqlContext.range(1000).map { i => i } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14696) Needs implicit encoders for boxed primitive types
[ https://issues.apache.org/jira/browse/SPARK-14696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14696: Assignee: Reynold Xin (was: Apache Spark) > Needs implicit encoders for boxed primitive types > - > > Key: SPARK-14696 > URL: https://issues.apache.org/jira/browse/SPARK-14696 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > We currently only have implicit encoders for scala primitive types. We should > also add implicit encoders for boxed primitives. Otherwise, the following > code would not have an encoder: > {code} > sqlContext.range(1000).map { i => i } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14696) Needs implicit encoders for boxed primitive types
[ https://issues.apache.org/jira/browse/SPARK-14696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14696: Assignee: Apache Spark (was: Reynold Xin) > Needs implicit encoders for boxed primitive types > - > > Key: SPARK-14696 > URL: https://issues.apache.org/jira/browse/SPARK-14696 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > We currently only have implicit encoders for scala primitive types. We should > also add implicit encoders for boxed primitives. Otherwise, the following > code would not have an encoder: > {code} > sqlContext.range(1000).map { i => i } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14696) Needs implicit encoders for boxed primitive types
Reynold Xin created SPARK-14696: --- Summary: Needs implicit encoders for boxed primitive types Key: SPARK-14696 URL: https://issues.apache.org/jira/browse/SPARK-14696 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin We currently only have implicit encoders for scala primitive types. We should also add implicit encoders for boxed primitives. Otherwise, the following code would not have an encoder: {code} sqlContext.range(1000).map { i => i } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14453) Consider removing SPARK_JAVA_OPTS env variable
[ https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245169#comment-15245169 ] Apache Spark commented on SPARK-14453: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/12465 > Consider removing SPARK_JAVA_OPTS env variable > -- > > Key: SPARK-14453 > URL: https://issues.apache.org/jira/browse/SPARK-14453 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Reporter: Saisai Shao >Priority: Minor > > SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version > (2.0), I think it would be better to remove the support of this env variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14453) Consider removing SPARK_JAVA_OPTS env variable
[ https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14453: Assignee: Apache Spark > Consider removing SPARK_JAVA_OPTS env variable > -- > > Key: SPARK-14453 > URL: https://issues.apache.org/jira/browse/SPARK-14453 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Minor > > SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version > (2.0), I think it would be better to remove the support of this env variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14453) Consider removing SPARK_JAVA_OPTS env variable
[ https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14453: Assignee: (was: Apache Spark) > Consider removing SPARK_JAVA_OPTS env variable > -- > > Key: SPARK-14453 > URL: https://issues.apache.org/jira/browse/SPARK-14453 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Reporter: Saisai Shao >Priority: Minor > > SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version > (2.0), I think it would be better to remove the support of this env variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12810) PySpark CrossValidatorModel should support avgMetrics
[ https://issues.apache.org/jira/browse/SPARK-12810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245152#comment-15245152 ] Apache Spark commented on SPARK-12810: -- User 'vectorijk' has created a pull request for this issue: https://github.com/apache/spark/pull/12464 > PySpark CrossValidatorModel should support avgMetrics > - > > Key: SPARK-12810 > URL: https://issues.apache.org/jira/browse/SPARK-12810 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Feynman Liang > Labels: starter > > The {CrossValidator} in Scala supports {avgMetrics} since 1.5.0, which allows > the user to evaluate how well each {ParamMap} in the grid search performed > and identify the best parameters. We should support this in PySpark as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12810) PySpark CrossValidatorModel should support avgMetrics
[ https://issues.apache.org/jira/browse/SPARK-12810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12810: Assignee: (was: Apache Spark) > PySpark CrossValidatorModel should support avgMetrics > - > > Key: SPARK-12810 > URL: https://issues.apache.org/jira/browse/SPARK-12810 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Feynman Liang > Labels: starter > > The {CrossValidator} in Scala supports {avgMetrics} since 1.5.0, which allows > the user to evaluate how well each {ParamMap} in the grid search performed > and identify the best parameters. We should support this in PySpark as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12810) PySpark CrossValidatorModel should support avgMetrics
[ https://issues.apache.org/jira/browse/SPARK-12810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12810: Assignee: Apache Spark > PySpark CrossValidatorModel should support avgMetrics > - > > Key: SPARK-12810 > URL: https://issues.apache.org/jira/browse/SPARK-12810 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Feynman Liang >Assignee: Apache Spark > Labels: starter > > The {CrossValidator} in Scala supports {avgMetrics} since 1.5.0, which allows > the user to evaluate how well each {ParamMap} in the grid search performed > and identify the best parameters. We should support this in PySpark as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13662) [SQL][Hive] Have SHOW TABLES return additional fields from Hive MetaStore
[ https://issues.apache.org/jira/browse/SPARK-13662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245129#comment-15245129 ] Evan Chan commented on SPARK-13662: --- Vijay, That would be awesome! Please go ahead. > [SQL][Hive] Have SHOW TABLES return additional fields from Hive MetaStore > -- > > Key: SPARK-13662 > URL: https://issues.apache.org/jira/browse/SPARK-13662 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.2, 1.6.0 > Environment: All >Reporter: Evan Chan > > Currently, the SHOW TABLES command in Spark's Hive ThriftServer, or > equivalently the HiveContext.tables method, returns a DataFrame with only two > columns: the name of the table and whether it is temporary. It would be > really nice to add support to return some extra information, such as: > - Whether this table is Spark-only or a native Hive table > - If spark-only, the name of the data source > - potentially other properties > The first two is really useful for BI environments connecting to multiple > data sources and that work with both Hive and Spark. > Some thoughts: > - The SQL/HiveContext Catalog API might need to be expanded to return > something like a TableEntry, rather than just a tuple of (name, temporary). > - I believe there is a Hive Catalog/client API to get information about each > table. I suppose one concern would be the speed of using this API. Perhaps > there are other APis that can get this info faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13904) Add support for pluggable cluster manager
[ https://issues.apache.org/jira/browse/SPARK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245118#comment-15245118 ] Hemant Bhanawat commented on SPARK-13904: - [~kiszk] I am looking into this. > Add support for pluggable cluster manager > - > > Key: SPARK-13904 > URL: https://issues.apache.org/jira/browse/SPARK-13904 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Hemant Bhanawat > > Currently Spark allows only a few cluster managers viz Yarn, Mesos and > Standalone. But, as Spark is now being used in newer and different use cases, > there is a need for allowing other cluster managers to manage spark > components. One such use case is - embedding spark components like executor > and driver inside another process which may be a datastore. This allows > colocation of data and processing. Another requirement that stems from such a > use case is that the executors/driver should not take the parent process down > when they go down and the components can be relaunched inside the same > process again. > So, this JIRA requests two functionalities: > 1. Support for external cluster managers > 2. Allow a cluster manager to clean up the tasks without taking the parent > process down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14647) Group SQLContext/HiveContext state into PersistentState
[ https://issues.apache.org/jira/browse/SPARK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245109#comment-15245109 ] Apache Spark commented on SPARK-14647: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/12463 > Group SQLContext/HiveContext state into PersistentState > --- > > Key: SPARK-14647 > URL: https://issues.apache.org/jira/browse/SPARK-14647 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > > This is analogous to SPARK-13526, which moved some things into > `SessionState`. After this issue we'll have an analogous `PersistentState` > that groups things to be shared across sessions. This will simplify the > constructors of the contexts significantly by allowing us to pass fewer > things into the contexts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14647) Group SQLContext/HiveContext state into PersistentState
[ https://issues.apache.org/jira/browse/SPARK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14647: Assignee: Andrew Or (was: Apache Spark) > Group SQLContext/HiveContext state into PersistentState > --- > > Key: SPARK-14647 > URL: https://issues.apache.org/jira/browse/SPARK-14647 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > > This is analogous to SPARK-13526, which moved some things into > `SessionState`. After this issue we'll have an analogous `PersistentState` > that groups things to be shared across sessions. This will simplify the > constructors of the contexts significantly by allowing us to pass fewer > things into the contexts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14647) Group SQLContext/HiveContext state into PersistentState
[ https://issues.apache.org/jira/browse/SPARK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14647: Assignee: Apache Spark (was: Andrew Or) > Group SQLContext/HiveContext state into PersistentState > --- > > Key: SPARK-14647 > URL: https://issues.apache.org/jira/browse/SPARK-14647 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Apache Spark > Fix For: 2.0.0 > > > This is analogous to SPARK-13526, which moved some things into > `SessionState`. After this issue we'll have an analogous `PersistentState` > that groups things to be shared across sessions. This will simplify the > constructors of the contexts significantly by allowing us to pass fewer > things into the contexts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14647) Group SQLContext/HiveContext state into PersistentState
[ https://issues.apache.org/jira/browse/SPARK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245100#comment-15245100 ] Yin Huai commented on SPARK-14647: -- Seems that test still timed out after we reverted the commit (https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/628/testReport/org.apache.spark.sql.hive/HiveSparkSubmitSuite/SPARK_9757_Persist_Parquet_relation_with_decimal_column/). > Group SQLContext/HiveContext state into PersistentState > --- > > Key: SPARK-14647 > URL: https://issues.apache.org/jira/browse/SPARK-14647 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > > This is analogous to SPARK-13526, which moved some things into > `SessionState`. After this issue we'll have an analogous `PersistentState` > that groups things to be shared across sessions. This will simplify the > constructors of the contexts significantly by allowing us to pass fewer > things into the contexts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14695) Error occurs when using OFF_HEAP persistent level
[ https://issues.apache.org/jira/browse/SPARK-14695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Lee updated SPARK-14695: -- Description: When running a PageRank job through the default examples, e.g., the class 'org.apache.spark.examples.graphx.Analytics' in spark-examples-1.6.0-hadoop2.6.0.jar package, we got the following erors: 16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 9.0 (TID 66) in 1662 ms on R1S1 (1/10) 16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 8.0 in stage 9.0 (TID 73) in 1663 ms on R1S1 (2/10) 16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 9.0 (TID 70) in 1672 ms on R1S1 (3/10) 16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 9.0 (TID 69) in 1680 ms on R1S1 (4/10) 16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 7.0 in stage 9.0 (TID 72) in 1678 ms on R1S1 (5/10) 16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 9.0 (TID 67) in 1682 ms on R1S1 (6/10) 16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 9.0 (TID 75) in 1710 ms on R1S1 (7/10) 16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 9.0 (TID 74) in 1729 ms on R1S1 (8/10) 16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 9.0 in stage 9.0 (TID 68) in 1838 ms on R1S1 (9/10) 16/04/18 03:17:25 WARN scheduler.TaskSetManager: Lost task 6.0 in stage 9.0 (TID 71, R1S1): java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -1 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:822) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:645) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:153) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 16/04/18 03:17:25 INFO scheduler.TaskSetManager: Starting task 6.1 in stage 9.0 (TID 76, R1S1, partition 6,PROCESS_LOCAL, 2171 bytes) 16/04/18 03:17:25 DEBUG hdfs.DFSClient: DataStreamer block BP-1194875811-10.3.1.3-1460617951862:blk_1073742842_2018 sending packet packet seqno:-1 offsetInBlock:0 lastPacketInBlock:false lastByteOffsetInBlock: 0 16/04/18 03:17:25 DEBUG hdfs.DFSClient: DFSClient seqno: -1 status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 653735 16/04/18 03:17:25 WARN scheduler.TaskSetManager: Lost task 6.1 in stage 9.0 (TID 76, R1S1): org.apache.spark.storage.BlockException: Block manager failed to return cached value for rdd_28_6! at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:158) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) We use the following script to submit the job: /Hadoop/spark-1.6.0-bin-hadoop2.6/bin/spark-submit --class org.apache.spark.examples.graphx.Analytics /Hadoop/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar pagerank /data/soc-LiveJournal1.txt --output=/output/live-off.res --numEPart=10 --numIter=1 --edgeStorageLevel=OFF_HEAP --vertexStorageLevel=OFF_HEAP When we set the storage level to MEMORY_ONLY
[jira] [Comment Edited] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245084#comment-15245084 ] Yong Tang edited comment on SPARK-14409 at 4/18/16 3:12 AM: Thanks [~mlnick] for the references. I will take a look at those and see what we could do with it. By the way, initially I though I could easily calling RankingMetrics in mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am having some trouble in implementation because the {code} @Since("2.0.0") override def evaluate(dataset: Dataset[_]): Double {code} in `RankingEvaluator` is not so easy to be converted into RankingMetrics's (`RDD[(Array[T], Array[T])]`). I will do some further investigation. If I can not find a easy way to convert the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly implementing the methods in new ml.evaluation (instead of calling mllib.evaluation). was (Author: yongtang): Thanks [~mlnick] for the references. I will take a look at those and see what we could do with it. By the way, initially I though I could easily calling RankingMetrics in mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am having some trouble in implementation because the ` @Since("2.0.0") override def evaluate(dataset: Dataset[_]): Double ` in `RankingEvaluator` is not so easy to be converted into RankingMetrics's (`RDD[(Array[T], Array[T])]`). I will do some further investigation. If I can not find a easy way to convert the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly implementing the methods in new ml.evaluation (instead of calling mllib.evaluation). > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245084#comment-15245084 ] Yong Tang commented on SPARK-14409: --- Thanks [~mlnick] for the references. I will take a look at those and see what we could do with it. By the way, initially I though I could easily calling RankingMetrics in mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am having some trouble in implementation because the ` @Since("2.0.0") override def evaluate(dataset: Dataset[_]): Double ` in `RankingEvaluator` is not so easy to be converted into RankingMetrics's (`RDD[(Array[T], Array[T])]`). I will do some further investigation. If I can not find a easy way to convert the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly implementing the methods in new ml.evaluation (instead of calling mllib.evaluation). > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14695) Error occurs when using OFF_HEAP persistent level
[ https://issues.apache.org/jira/browse/SPARK-14695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245083#comment-15245083 ] Liang Lee commented on SPARK-14695: --- The cluster enviroment is like this: Totally 3 servers. One acts as NameNode, SparkMaster and TachyonMaster; the other two act as DataNode, Spark Worker and Tachyon Worker. We set the Tachyon Worker Memory to 64GB per node and Total 128GB, only memory level is enabled in Tachyon. We submit the job on Master node. The most strange question is: In the same worker, most executors can finish the task correctly but only 1 or 2 executors failed to cache the block and cause the above errors. > Error occurs when using OFF_HEAP persistent level > -- > > Key: SPARK-14695 > URL: https://issues.apache.org/jira/browse/SPARK-14695 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 1.6.0 > Environment: Spark 1.6.0 > Tachyon 0.8.2 > Hadoop 2.6.0 >Reporter: Liang Lee > > When running a PageRank job through the default examples, e.g., the class > 'org.apache.spark.examples.graphx.Analytics' in > spark-examples-1.6.0-hadoop2.6.0.jar package, we got the following erors: > 16/04/18 02:30:01 WARN scheduler.TaskSetManager: Lost task 9.0 in stage 6.0 > (TID > 53, R1S1): java.lang.IllegalArgumentException: requirement failed: > sizeInBytes > was negative: -1 > at scala.Predef$.require(Predef.scala:233) > at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:822) > at > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala: > 645) > at > org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:15 > 3) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala: > 38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scal > a:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scal > a:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor. > java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor > .java:615) > at java.lang.Thread.run(Thread.java:745) > We use the following script to submit the job: > /Hadoop/spark-1.6.0-bin-hadoop2.6/bin/spark-submit --class > org.apache.spark.examples.graphx.Analytics > /Hadoop/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar > pagerank /data/soc-LiveJournal1.txt --output=/output/live-off.res > --numEPart=10 --numIter=1 --edgeStorageLevel=OFF_HEAP > --vertexStorageLevel=OFF_HEAP > When we set the storage level to MEMORY_ONLY or DISK_ONLY, there is no error > and the job can finished correctly. > But when we set the storage level to OFF_HEAP, which means using Tachyon for > the storage process, the error occurs. > The executors stack is like this, seems the write block to Tahcyon failed. > 16/04/18 02:25:54 ERROR ExternalBlockStore: Error in putValues(rdd_20_1) > java.io.IOException: Fail to cache: null > at > tachyon.client.file.FileOutStream.handleCacheWriteException(FileOutStream.java:276) > at tachyon.client.file.FileOutStream.close(FileOutStream.java:165) > at > org.apache.spark.storage.TachyonBlockManager.putValues(TachyonBlockManager.scala:126) > at > org.apache.spark.storage.ExternalBlockStore.putIntoExternalBlockStore(ExternalBlockStore.scala:79) > at > org.apache.spark.storage.ExternalBlockStore.putIterator(ExternalBlockStore.scala:67) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:798) > at > org.apac
[jira] [Commented] (SPARK-14628) Remove all the Options in TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-14628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245079#comment-15245079 ] Apache Spark commented on SPARK-14628: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/12462 > Remove all the Options in TaskMetrics > - > > Key: SPARK-14628 > URL: https://issues.apache.org/jira/browse/SPARK-14628 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > Part of the reason why TaskMetrics and its callers are complicated are due to > the optional metrics we collect, including input, output, shuffle read, and > shuffle write. Given their values are zero, I think we can always track them. > It is usually very obvious whether a task is supposed to read any data or > not. By always tracking them, we can remove a lot of map, foreach, flatMap, > getOrElse(0L) calls throughout Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245077#comment-15245077 ] Apache Spark commented on SPARK-14409: -- User 'yongtang' has created a pull request for this issue: https://github.com/apache/spark/pull/12461 > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14409: Assignee: (was: Apache Spark) > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14409: Assignee: Apache Spark > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Assignee: Apache Spark >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14695) Error occurs when using OFF_HEAP persistent level
Liang Lee created SPARK-14695: - Summary: Error occurs when using OFF_HEAP persistent level Key: SPARK-14695 URL: https://issues.apache.org/jira/browse/SPARK-14695 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 1.6.0 Environment: Spark 1.6.0 Tachyon 0.8.2 Hadoop 2.6.0 Reporter: Liang Lee When running a PageRank job through the default examples, e.g., the class 'org.apache.spark.examples.graphx.Analytics' in spark-examples-1.6.0-hadoop2.6.0.jar package, we got the following erors: 16/04/18 02:30:01 WARN scheduler.TaskSetManager: Lost task 9.0 in stage 6.0 (TID53, R1S1): java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -1 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:822) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala: 645) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:15 3) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala: 38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scal a:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scal a:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor. java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor .java:615) at java.lang.Thread.run(Thread.java:745) We use the following script to submit the job: /Hadoop/spark-1.6.0-bin-hadoop2.6/bin/spark-submit --class org.apache.spark.examples.graphx.Analytics /Hadoop/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar pagerank /data/soc-LiveJournal1.txt --output=/output/live-off.res --numEPart=10 --numIter=1 --edgeStorageLevel=OFF_HEAP --vertexStorageLevel=OFF_HEAP When we set the storage level to MEMORY_ONLY or DISK_ONLY, there is no error and the job can finished correctly. But when we set the storage level to OFF_HEAP, which means using Tachyon for the storage process, the error occurs. The executors stack is like this, seems the write block to Tahcyon failed. 16/04/18 02:25:54 ERROR ExternalBlockStore: Error in putValues(rdd_20_1) java.io.IOException: Fail to cache: null at tachyon.client.file.FileOutStream.handleCacheWriteException(FileOutStream.java:276) at tachyon.client.file.FileOutStream.close(FileOutStream.java:165) at org.apache.spark.storage.TachyonBlockManager.putValues(TachyonBlockManager.scala:126) at org.apache.spark.storage.ExternalBlockStore.putIntoExternalBlockStore(ExternalBlockStore.scala:79) at org.apache.spark.storage.ExternalBlockStore.putIterator(ExternalBlockStore.scala:67) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:798) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:645) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:153) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.
[jira] [Created] (SPARK-14694) Thrift Server + Hive Metastore + Kerberos doesn't work
zhangguancheng created SPARK-14694: -- Summary: Thrift Server + Hive Metastore + Kerberos doesn't work Key: SPARK-14694 URL: https://issues.apache.org/jira/browse/SPARK-14694 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1, 1.6.0 Environment: Spark 1.6.1. compiled with hadoop 2.6.0, yarn, hive Hadoop 2.6.4 Hive 1.1.1 Kerberos Reporter: zhangguancheng My Hive Metasore is MySQL based. I started a spark thrift server on the same node as the Hive Metastore. I can open beeline and run select statements but for some commands like "show databases", I get an error: {quote} ERROR pool-24-thread-1 org.apache.thrift.transport.TSaslTransport:315 SASL negotiation failure javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271) at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024) at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234) at org.apache.hadoop.hive.ql.exec.DDLTask.showDatabases(DDLTask.java:2223) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:385) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1653) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1412) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1195) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:495) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:484) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:290) at org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:237) at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:236) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:279) at org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:484) at org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:474) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:605) at org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) at org.ap
[jira] [Updated] (SPARK-14693) Spark Streaming Context Hangs on Start
[ https://issues.apache.org/jira/browse/SPARK-14693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Evan Oman updated SPARK-14693: -- Description: All, I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks and my `ssc.start()` command is hanging. I am using the following function (based on [this guide|http://spark.apache.org/docs/latest/streaming-kinesis-integration.html], which, as an aside, contains some broken Github links) to make my Spark Streaming Context: {code:borderStyle=solid} def creatingFunc(sc: SparkContext): StreamingContext = { // Create a StreamingContext val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds)) // Creata a Kinesis stream val kinesisStream = KinesisUtils.createStream(ssc, kinesisAppName, kinesisStreamName, kinesisEndpointUrl, RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName, InitialPositionInStream.LATEST, Seconds(kinesisCheckpointIntervalSeconds), StorageLevel.MEMORY_AND_DISK_SER_2, config.awsAccessKeyId, config.awsSecretKey) kinesisStream.print() ssc.remember(Minutes(1)) ssc.checkpoint(checkpointDir) ssc } {code} However when I run the following to start the streaming context: {code:borderStyle=solid} // Stop any existing StreamingContext val stopActiveContext = true if (stopActiveContext) { StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) } } // Get or create a streaming context. val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc)) // This starts the streaming context in the background. ssc.start() {code} The last bit, `ssc.start()`, hangs indefinitely without issuing any log messages. I am running this on a freshly spun up cluster with no other notebooks attached so there aren't any other streaming contexts running. Any thoughts? Additionally, here are the libraries I am using (from my build.sbt file): {code:borderStyle=solid} "org.apache.spark" % "spark-core_2.10" % "1.6.0" "org.apache.spark" % "spark-sql_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming_2.10" % "1.6.0" {code} was: All, I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks and my `ssc.start()` command is hanging. I am using the following function (based on http://spark.apache.org/docs/latest/streaming-kinesis-integration.html, which, as an aside, contains some broken Github links) to make my Spark Streaming Context: {code:borderStyle=solid} def creatingFunc(sc: SparkContext): StreamingContext = { // Create a StreamingContext val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds)) // Creata a Kinesis stream val kinesisStream = KinesisUtils.createStream(ssc, kinesisAppName, kinesisStreamName, kinesisEndpointUrl, RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName, InitialPositionInStream.LATEST, Seconds(kinesisCheckpointIntervalSeconds), StorageLevel.MEMORY_AND_DISK_SER_2, config.awsAccessKeyId, config.awsSecretKey) kinesisStream.print() ssc.remember(Minutes(1)) ssc.checkpoint(checkpointDir) ssc } {code} However when I run the following to start the streaming context: {code:borderStyle=solid} // Stop any existing StreamingContext val stopActiveContext = true if (stopActiveContext) { StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) } } // Get or create a streaming context. val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc)) // This starts the streaming context in the background. ssc.start() {code} The last bit, `ssc.start()`, hangs indefinitely without issuing any log messages. I am running this on a freshly spun up cluster with no other notebooks attached so there aren't any other streaming contexts running. Any thoughts? Additionally, here are the libraries I am using (from my build.sbt file): {code:borderStyle=solid} "org.apache.spark" % "spark-core_2.10" % "1.6.0" "org.apache.spark" % "spark-sql_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming_2.10" % "1.6.0" {code} > Spark Streaming Context Hangs on Start > -- > > Key: SPARK-14693 > URL: https://issues.apache.org/jira/browse/SPARK-14693 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0, 1.6.1 > Environment: Databricks Cl
[jira] [Updated] (SPARK-14693) Spark Streaming Context Hangs on Start
[ https://issues.apache.org/jira/browse/SPARK-14693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Evan Oman updated SPARK-14693: -- Description: All, I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks and my `ssc.start()` command is hanging. I am using the following function (based on http://spark.apache.org/docs/latest/streaming-kinesis-integration.html, which, as an aside, contains some broken Github links) to make my Spark Streaming Context: {code:borderStyle=solid} def creatingFunc(sc: SparkContext): StreamingContext = { // Create a StreamingContext val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds)) // Creata a Kinesis stream val kinesisStream = KinesisUtils.createStream(ssc, kinesisAppName, kinesisStreamName, kinesisEndpointUrl, RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName, InitialPositionInStream.LATEST, Seconds(kinesisCheckpointIntervalSeconds), StorageLevel.MEMORY_AND_DISK_SER_2, config.awsAccessKeyId, config.awsSecretKey) kinesisStream.print() ssc.remember(Minutes(1)) ssc.checkpoint(checkpointDir) ssc } {code} However when I run the following to start the streaming context: {code:borderStyle=solid} // Stop any existing StreamingContext val stopActiveContext = true if (stopActiveContext) { StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) } } // Get or create a streaming context. val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc)) // This starts the streaming context in the background. ssc.start() {code} The last bit, `ssc.start()`, hangs indefinitely without issuing any log messages. I am running this on a freshly spun up cluster with no other notebooks attached so there aren't any other streaming contexts running. Any thoughts? Additionally, here are the libraries I am using (from my build.sbt file): {code:borderStyle=solid} "org.apache.spark" % "spark-core_2.10" % "1.6.0" "org.apache.spark" % "spark-sql_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming_2.10" % "1.6.0" {code} was: All, I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks and my `ssc.start()` command is hanging. I am using the following function to make my Spark Streaming Context: {code:borderStyle=solid} def creatingFunc(sc: SparkContext): StreamingContext = { // Create a StreamingContext val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds)) // Creata a Kinesis stream val kinesisStream = KinesisUtils.createStream(ssc, kinesisAppName, kinesisStreamName, kinesisEndpointUrl, RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName, InitialPositionInStream.LATEST, Seconds(kinesisCheckpointIntervalSeconds), StorageLevel.MEMORY_AND_DISK_SER_2, config.awsAccessKeyId, config.awsSecretKey) kinesisStream.print() ssc.remember(Minutes(1)) ssc.checkpoint(checkpointDir) ssc } {code} However when I run the following to start the streaming context: {code:borderStyle=solid} // Stop any existing StreamingContext val stopActiveContext = true if (stopActiveContext) { StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) } } // Get or create a streaming context. val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc)) // This starts the streaming context in the background. ssc.start() {code} The last bit, `ssc.start()`, hangs indefinitely without issuing any log messages. I am running this on a freshly spun up cluster with no other notebooks attached so there aren't any other streaming contexts running. Any thoughts? Additionally, here are the libraries I am using (from my build.sbt file): {code:borderStyle=solid} "org.apache.spark" % "spark-core_2.10" % "1.6.0" "org.apache.spark" % "spark-sql_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming_2.10" % "1.6.0" {code} > Spark Streaming Context Hangs on Start > -- > > Key: SPARK-14693 > URL: https://issues.apache.org/jira/browse/SPARK-14693 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0, 1.6.1 > Environment: Databricks Cloud >Reporter: Evan Oman > > All, > I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks > and my `ssc.start()` com
[jira] [Updated] (SPARK-14693) Spark Streaming Context Hangs on Start
[ https://issues.apache.org/jira/browse/SPARK-14693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Evan Oman updated SPARK-14693: -- Description: All, I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks and my `ssc.start()` command is hanging. I am using the following function to make my Spark Streaming Context: {code:borderStyle=solid} def creatingFunc(sc: SparkContext): StreamingContext = { // Create a StreamingContext val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds)) // Creata a Kinesis stream val kinesisStream = KinesisUtils.createStream(ssc, kinesisAppName, kinesisStreamName, kinesisEndpointUrl, RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName, InitialPositionInStream.LATEST, Seconds(kinesisCheckpointIntervalSeconds), StorageLevel.MEMORY_AND_DISK_SER_2, config.awsAccessKeyId, config.awsSecretKey) kinesisStream.print() ssc.remember(Minutes(1)) ssc.checkpoint(checkpointDir) ssc } {code} However when I run the following to start the streaming context: {code:borderStyle=solid} // Stop any existing StreamingContext val stopActiveContext = true if (stopActiveContext) { StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) } } // Get or create a streaming context. val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc)) // This starts the streaming context in the background. ssc.start() {code} The last bit, `ssc.start()`, hangs indefinitely without issuing any log messages. I am running this on a freshly spun up cluster with no other notebooks attached so there aren't any other streaming contexts running. Any thoughts? Additionally, here are the libraries I am using (from my build.sbt file): {code:borderStyle=solid} "org.apache.spark" % "spark-core_2.10" % "1.6.0" "org.apache.spark" % "spark-sql_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming_2.10" % "1.6.0" {code} was: All, I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks and my `ssc.start()` command is hanging. I am using the following function to make my Spark Streaming Context: {code:scala|borderStyle=solid} def creatingFunc(sc: SparkContext): StreamingContext = { // Create a StreamingContext val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds)) // Creata a Kinesis stream val kinesisStream = KinesisUtils.createStream(ssc, kinesisAppName, kinesisStreamName, kinesisEndpointUrl, RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName, InitialPositionInStream.LATEST, Seconds(kinesisCheckpointIntervalSeconds), StorageLevel.MEMORY_AND_DISK_SER_2, config.awsAccessKeyId, config.awsSecretKey) kinesisStream.print() ssc.remember(Minutes(1)) ssc.checkpoint(checkpointDir) ssc } {code} However when I run the following to start the streaming context: {code:scala|borderStyle=solid} // Stop any existing StreamingContext val stopActiveContext = true if (stopActiveContext) { StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) } } // Get or create a streaming context. val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc)) // This starts the streaming context in the background. ssc.start() {code} The last bit, `ssc.start()`, hangs indefinitely without issuing any log messages. I am running this on a freshly spun up cluster with no other notebooks attached so there aren't any other streaming contexts running. Any thoughts? Additionally, here are the libraries I am using (from my build.sbt file): {code:borderStyle=solid} "org.apache.spark" % "spark-core_2.10" % "1.6.0" "org.apache.spark" % "spark-sql_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming_2.10" % "1.6.0" {code} > Spark Streaming Context Hangs on Start > -- > > Key: SPARK-14693 > URL: https://issues.apache.org/jira/browse/SPARK-14693 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0, 1.6.1 > Environment: Databricks Cloud >Reporter: Evan Oman > > All, > I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks > and my `ssc.start()` command is hanging. > I am using the following function to make my Spark Streaming Context: > {code:borderStyle=solid} > def creat
[jira] [Assigned] (SPARK-14127) [Table related commands] Describe table
[ https://issues.apache.org/jira/browse/SPARK-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14127: Assignee: Apache Spark > [Table related commands] Describe table > --- > > Key: SPARK-14127 > URL: https://issues.apache.org/jira/browse/SPARK-14127 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > > TOK_DESCTABLE > Describe a column/table/partition (see here and here). Seems we support > DESCRIBE and DESCRIBE EXTENDED. It will be good to also support other > syntaxes (and check if we are missing anything). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14693) Spark Streaming Context Hangs on Start
[ https://issues.apache.org/jira/browse/SPARK-14693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Evan Oman updated SPARK-14693: -- Description: All, I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks and my `ssc.start()` command is hanging. I am using the following function to make my Spark Streaming Context: {code:scala|borderStyle=solid} def creatingFunc(sc: SparkContext): StreamingContext = { // Create a StreamingContext val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds)) // Creata a Kinesis stream val kinesisStream = KinesisUtils.createStream(ssc, kinesisAppName, kinesisStreamName, kinesisEndpointUrl, RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName, InitialPositionInStream.LATEST, Seconds(kinesisCheckpointIntervalSeconds), StorageLevel.MEMORY_AND_DISK_SER_2, config.awsAccessKeyId, config.awsSecretKey) kinesisStream.print() ssc.remember(Minutes(1)) ssc.checkpoint(checkpointDir) ssc } {code} However when I run the following to start the streaming context: {code:scala|borderStyle=solid} // Stop any existing StreamingContext val stopActiveContext = true if (stopActiveContext) { StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) } } // Get or create a streaming context. val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc)) // This starts the streaming context in the background. ssc.start() {code} The last bit, `ssc.start()`, hangs indefinitely without issuing any log messages. I am running this on a freshly spun up cluster with no other notebooks attached so there aren't any other streaming contexts running. Any thoughts? Additionally, here are the libraries I am using (from my build.sbt file): {code:borderStyle=solid} "org.apache.spark" % "spark-core_2.10" % "1.6.0" "org.apache.spark" % "spark-sql_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming_2.10" % "1.6.0" {code} was: All, I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks and my `ssc.start()` command is hanging. I am using the following function to make my Spark Streaming Context: {code:borderStyle=solid} def creatingFunc(sc: SparkContext): StreamingContext = { // Create a StreamingContext val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds)) // Creata a Kinesis stream val kinesisStream = KinesisUtils.createStream(ssc, kinesisAppName, kinesisStreamName, kinesisEndpointUrl, RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName, InitialPositionInStream.LATEST, Seconds(kinesisCheckpointIntervalSeconds), StorageLevel.MEMORY_AND_DISK_SER_2, config.awsAccessKeyId, config.awsSecretKey) kinesisStream.print() ssc.remember(Minutes(1)) ssc.checkpoint(checkpointDir) ssc } {code} However when I run the following to start the streaming context: {code:borderStyle=solid} // Stop any existing StreamingContext val stopActiveContext = true if (stopActiveContext) { StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) } } // Get or create a streaming context. val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc)) // This starts the streaming context in the background. ssc.start() {code} The last bit, `ssc.start()`, hangs indefinitely without issuing any log messages. I am running this on a freshly spun up cluster with no other notebooks attached so there aren't any other streaming contexts running. Any thoughts? Additionally, here are the libraries I am using (from my build.sbt file): {code:borderStyle=solid} "org.apache.spark" % "spark-core_2.10" % "1.6.0" "org.apache.spark" % "spark-sql_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming_2.10" % "1.6.0" {code} > Spark Streaming Context Hangs on Start > -- > > Key: SPARK-14693 > URL: https://issues.apache.org/jira/browse/SPARK-14693 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0, 1.6.1 > Environment: Databricks Cloud >Reporter: Evan Oman > > All, > I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks > and my `ssc.start()` command is hanging. > I am using the following function to make my Spark Streaming Context: > {code:scala|borderStyle=solid} > def
[jira] [Assigned] (SPARK-14127) [Table related commands] Describe table
[ https://issues.apache.org/jira/browse/SPARK-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14127: Assignee: (was: Apache Spark) > [Table related commands] Describe table > --- > > Key: SPARK-14127 > URL: https://issues.apache.org/jira/browse/SPARK-14127 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > TOK_DESCTABLE > Describe a column/table/partition (see here and here). Seems we support > DESCRIBE and DESCRIBE EXTENDED. It will be good to also support other > syntaxes (and check if we are missing anything). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14127) [Table related commands] Describe table
[ https://issues.apache.org/jira/browse/SPARK-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245069#comment-15245069 ] Apache Spark commented on SPARK-14127: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/12460 > [Table related commands] Describe table > --- > > Key: SPARK-14127 > URL: https://issues.apache.org/jira/browse/SPARK-14127 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > TOK_DESCTABLE > Describe a column/table/partition (see here and here). Seems we support > DESCRIBE and DESCRIBE EXTENDED. It will be good to also support other > syntaxes (and check if we are missing anything). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14693) Spark Streaming Context Hangs on Start
Evan Oman created SPARK-14693: - Summary: Spark Streaming Context Hangs on Start Key: SPARK-14693 URL: https://issues.apache.org/jira/browse/SPARK-14693 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.6.1, 1.6.0 Environment: Databricks Cloud Reporter: Evan Oman All, I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks and my `ssc.start()` command is hanging. I am using the following function to make my Spark Streaming Context: {code:borderStyle=solid} def creatingFunc(sc: SparkContext): StreamingContext = { // Create a StreamingContext val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds)) // Creata a Kinesis stream val kinesisStream = KinesisUtils.createStream(ssc, kinesisAppName, kinesisStreamName, kinesisEndpointUrl, RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName, InitialPositionInStream.LATEST, Seconds(kinesisCheckpointIntervalSeconds), StorageLevel.MEMORY_AND_DISK_SER_2, config.awsAccessKeyId, config.awsSecretKey) kinesisStream.print() ssc.remember(Minutes(1)) ssc.checkpoint(checkpointDir) ssc } {code} However when I run the following to start the streaming context: {code:borderStyle=solid} // Stop any existing StreamingContext val stopActiveContext = true if (stopActiveContext) { StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) } } // Get or create a streaming context. val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc)) // This starts the streaming context in the background. ssc.start() {code} The last bit, `ssc.start()`, hangs indefinitely without issuing any log messages. I am running this on a freshly spun up cluster with no other notebooks attached so there aren't any other streaming contexts running. Any thoughts? Additionally, here are the libraries I am using (from my build.sbt file): {code:borderStyle=solid} "org.apache.spark" % "spark-core_2.10" % "1.6.0" "org.apache.spark" % "spark-sql_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0" "org.apache.spark" % "spark-streaming_2.10" % "1.6.0" {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14453) Consider removing SPARK_JAVA_OPTS env variable
[ https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245065#comment-15245065 ] Krishnan Narayan commented on SPARK-14453: -- +1 > Consider removing SPARK_JAVA_OPTS env variable > -- > > Key: SPARK-14453 > URL: https://issues.apache.org/jira/browse/SPARK-14453 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Reporter: Saisai Shao >Priority: Minor > > SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version > (2.0), I think it would be better to remove the support of this env variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14453) Consider removing SPARK_JAVA_OPTS env variable
[ https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245048#comment-15245048 ] Saisai Shao commented on SPARK-14453: - Yes, this should be a part of SPARK-12344. Since SPARK-12344 is already a subtask, so I cannot make this as a subtask belongs to that. > Consider removing SPARK_JAVA_OPTS env variable > -- > > Key: SPARK-14453 > URL: https://issues.apache.org/jira/browse/SPARK-14453 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Reporter: Saisai Shao >Priority: Minor > > SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version > (2.0), I think it would be better to remove the support of this env variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14692) Error While Setting the path for R front end
Niranjan Molkeri` created SPARK-14692: - Summary: Error While Setting the path for R front end Key: SPARK-14692 URL: https://issues.apache.org/jira/browse/SPARK-14692 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.6.1 Environment: Mac OSX Reporter: Niranjan Molkeri` Trying to set Environment path for SparkR in RStudio. Getting this bug. > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) > library(SparkR) Error in library(SparkR) : there is no package called ‘SparkR’ > sc <- sparkR.init(master="local") Error: could not find function "sparkR.init" In the directory which it is pointed. There is directory called SparkR. I don't know how to proceed with this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13662) [SQL][Hive] Have SHOW TABLES return additional fields from Hive MetaStore
[ https://issues.apache.org/jira/browse/SPARK-13662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245029#comment-15245029 ] Vijay Parmar commented on SPARK-13662: -- Hi Evan, I would like to look into the issue. Let me know if I can go ahead on this? Thanks Vijay > [SQL][Hive] Have SHOW TABLES return additional fields from Hive MetaStore > -- > > Key: SPARK-13662 > URL: https://issues.apache.org/jira/browse/SPARK-13662 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.2, 1.6.0 > Environment: All >Reporter: Evan Chan > > Currently, the SHOW TABLES command in Spark's Hive ThriftServer, or > equivalently the HiveContext.tables method, returns a DataFrame with only two > columns: the name of the table and whether it is temporary. It would be > really nice to add support to return some extra information, such as: > - Whether this table is Spark-only or a native Hive table > - If spark-only, the name of the data source > - potentially other properties > The first two is really useful for BI environments connecting to multiple > data sources and that work with both Hive and Spark. > Some thoughts: > - The SQL/HiveContext Catalog API might need to be expanded to return > something like a TableEntry, rather than just a tuple of (name, temporary). > - I believe there is a Hive Catalog/client API to get information about each > table. I suppose one concern would be the speed of using this API. Perhaps > there are other APis that can get this info faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14691) Simplify and Unify Error Generation for Unsupported Alter Table DDL
[ https://issues.apache.org/jira/browse/SPARK-14691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14691: Assignee: (was: Apache Spark) > Simplify and Unify Error Generation for Unsupported Alter Table DDL > --- > > Key: SPARK-14691 > URL: https://issues.apache.org/jira/browse/SPARK-14691 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > So far, we are capturing each unsupported Alter Table in separate visit > functions. They should be unified and issue a ParseException instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14691) Simplify and Unify Error Generation for Unsupported Alter Table DDL
[ https://issues.apache.org/jira/browse/SPARK-14691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14691: Assignee: Apache Spark > Simplify and Unify Error Generation for Unsupported Alter Table DDL > --- > > Key: SPARK-14691 > URL: https://issues.apache.org/jira/browse/SPARK-14691 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > So far, we are capturing each unsupported Alter Table in separate visit > functions. They should be unified and issue a ParseException instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14691) Simplify and Unify Error Generation for Unsupported Alter Table DDL
[ https://issues.apache.org/jira/browse/SPARK-14691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245028#comment-15245028 ] Apache Spark commented on SPARK-14691: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/12459 > Simplify and Unify Error Generation for Unsupported Alter Table DDL > --- > > Key: SPARK-14691 > URL: https://issues.apache.org/jira/browse/SPARK-14691 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > So far, we are capturing each unsupported Alter Table in separate visit > functions. They should be unified and issue a ParseException instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14691) Simplify and Unify Error Generation for Unsupported Alter Table DDL
Xiao Li created SPARK-14691: --- Summary: Simplify and Unify Error Generation for Unsupported Alter Table DDL Key: SPARK-14691 URL: https://issues.apache.org/jira/browse/SPARK-14691 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li So far, we are capturing each unsupported Alter Table in separate visit functions. They should be unified and issue a ParseException instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14325) some strange name conflicts in `group_by`
[ https://issues.apache.org/jira/browse/SPARK-14325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245024#comment-15245024 ] Vijay Parmar commented on SPARK-14325: -- Hi Dmitriy, It has been a very short time I joined the community. Would like to look into the issue if no other member has taken it up? Please let me know if I can go ahead? Thanks Vijay > some strange name conflicts in `group_by` > - > > Key: SPARK-14325 > URL: https://issues.apache.org/jira/browse/SPARK-14325 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0, 1.6.1 > Environment: sparkR 1.6.0 >Reporter: Dmitriy Selivanov > > group_by strange behaviour when try to aggregate by column with name "x". > consider following example > {code} > df > # DataFrame[userId:bigint, type:string, j:int, x:int] > df %>%group_by(df$userId, df$type, df$j) %>% agg(x = "sum") > #Error in (function (classes, fdef, mtable) : > # unable to find an inherited method for function ‘agg’ for signature > ‘"character"’ > {code} > after renaming x -> x2 works just file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14647) Group SQLContext/HiveContext state into PersistentState
[ https://issues.apache.org/jira/browse/SPARK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245009#comment-15245009 ] Andrew Or commented on SPARK-14647: --- I've reverted it for now. > Group SQLContext/HiveContext state into PersistentState > --- > > Key: SPARK-14647 > URL: https://issues.apache.org/jira/browse/SPARK-14647 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > > This is analogous to SPARK-13526, which moved some things into > `SessionState`. After this issue we'll have an analogous `PersistentState` > that groups things to be shared across sessions. This will simplify the > constructors of the contexts significantly by allowing us to pass fewer > things into the contexts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14586) SparkSQL doesn't parse decimal like Hive
[ https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244986#comment-15244986 ] Stephane Maarek commented on SPARK-14586: - Hi [~tsuresh], thanks for your reply. It makes sense! I'm using Hive 1.2.1. My only concern is that looking at the code, I understand why the number wouldn't be parsed correctly in Spark and Hive, but I don't understand why Hive 1.2.1 CLI would parse the number correctly (as seen in my troubleshooting)? Isn't Spark using the exact same logic as Hive? > SparkSQL doesn't parse decimal like Hive > > > Key: SPARK-14586 > URL: https://issues.apache.org/jira/browse/SPARK-14586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Stephane Maarek > > create a test_data.csv with the following > {code:none} > a, 2.0 > ,3.0 > {code} > (the space is intended before the 2) > copy the test_data.csv to hdfs:///spark_testing_2 > go in hive, run the following statements > {code:sql} > CREATE SCHEMA IF NOT EXISTS spark_testing; > DROP TABLE IF EXISTS spark_testing.test_csv_2; > CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( > column_1 varchar(10), > column_2 decimal(4,2)) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE LOCATION '/spark_testing_2' > TBLPROPERTIES('serialization.null.format'=''); > select * from spark_testing.test_csv_2; > OK > a 2 > NULL3 > {code} > As you can see, the value " 2" gets parsed correctly to 2 > Now onto Spark-shell: > {code:java} > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.sql("select * from spark_testing.test_csv_2").show() > +++ > |column_1|column_2| > +++ > | a|null| > |null|3.00| > +++ > {code} > As you can see, the " 2" got parsed to null. Therefore Hive and Spark don't > have a similar parsing behavior for decimals. I wouldn't say it is a bug per > se, but it looks like a necessary improvement for the two engines to > converge. Hive version is 1.5.1 > Not sure if relevant, but Scala does parse numbers with leading space > correctly > {code} > scala> "2.0".toDouble > res21: Double = 2.0 > scala> " 2.0".toDouble > res22: Double = 2.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14642) import org.apache.spark.sql.expressions._ breaks udf under functions
[ https://issues.apache.org/jira/browse/SPARK-14642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244984#comment-15244984 ] Apache Spark commented on SPARK-14642: -- User 'sbcd90' has created a pull request for this issue: https://github.com/apache/spark/pull/12458 > import org.apache.spark.sql.expressions._ breaks udf under functions > > > Key: SPARK-14642 > URL: https://issues.apache.org/jira/browse/SPARK-14642 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > The following code works > {code} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> udf((v: String) => v.stripSuffix("-abc")) > res0: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,StringType,Some(List(StringType))) > {code} > But, the following does not > {code} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> import org.apache.spark.sql.expressions._ > import org.apache.spark.sql.expressions._ > scala> udf((v: String) => v.stripSuffix("-abc")) > :30: error: No TypeTag available for String >udf((v: String) => v.stripSuffix("-abc")) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14642) import org.apache.spark.sql.expressions._ breaks udf under functions
[ https://issues.apache.org/jira/browse/SPARK-14642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14642: Assignee: Apache Spark > import org.apache.spark.sql.expressions._ breaks udf under functions > > > Key: SPARK-14642 > URL: https://issues.apache.org/jira/browse/SPARK-14642 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Critical > > The following code works > {code} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> udf((v: String) => v.stripSuffix("-abc")) > res0: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,StringType,Some(List(StringType))) > {code} > But, the following does not > {code} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> import org.apache.spark.sql.expressions._ > import org.apache.spark.sql.expressions._ > scala> udf((v: String) => v.stripSuffix("-abc")) > :30: error: No TypeTag available for String >udf((v: String) => v.stripSuffix("-abc")) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14642) import org.apache.spark.sql.expressions._ breaks udf under functions
[ https://issues.apache.org/jira/browse/SPARK-14642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14642: Assignee: (was: Apache Spark) > import org.apache.spark.sql.expressions._ breaks udf under functions > > > Key: SPARK-14642 > URL: https://issues.apache.org/jira/browse/SPARK-14642 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > The following code works > {code} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> udf((v: String) => v.stripSuffix("-abc")) > res0: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,StringType,Some(List(StringType))) > {code} > But, the following does not > {code} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> import org.apache.spark.sql.expressions._ > import org.apache.spark.sql.expressions._ > scala> udf((v: String) => v.stripSuffix("-abc")) > :30: error: No TypeTag available for String >udf((v: String) => v.stripSuffix("-abc")) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14632) randomSplit method fails on dataframes with maps in schema
[ https://issues.apache.org/jira/browse/SPARK-14632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14632. - Resolution: Fixed Assignee: Subhobrata Dey Fix Version/s: 2.0.0 > randomSplit method fails on dataframes with maps in schema > -- > > Key: SPARK-14632 > URL: https://issues.apache.org/jira/browse/SPARK-14632 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Stefano Costantini >Assignee: Subhobrata Dey > Fix For: 2.0.0 > > > Applying the randomSplit method to a dataframe with at least one map in the > schema results in an exception > {noformat} > org.apache.spark.sql.AnalysisException: cannot resolve 'features ASC' due to > data type mismatch: cannot sort data type map; > {noformat} > This bug can be reproduced as follows: > {code} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.implicits._ > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val arr = Array(("user1", Map("f1" -> 1.0, "f2" -> 1.0)), ("user2", Map("f2" > -> 1.0, "f3" -> 1.0)), ("user3",Map("f1" -> 1.0, "f2" -> 1.0))) > val df = sc.parallelize(arr).toDF("user","features") > df.printSchema > val Array(split1, split2) = df.randomSplit(Array(0.7, 0.3), seed = 101L) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244918#comment-15244918 ] Narine Kokhlikyan commented on SPARK-12922: --- Hi [~sunrui], I’ve made some progress in putting logical and physical plans together and calling R workers, however I still have some questions. 1. I’m still not quite sure about the number of partitions. As you wrote in https://issues.apache.org/jira/browse/SPARK-6817 we need to tune the number of partitions based on “spark.sql.shuffle.partitions”. What do you exactly mean by tuning? Repartitioning ? 2. I have another question about grouping by keys: groupByKey with one key is fine, however if we have more than one key we probably need to introduce a case class. With a case class it looks okay too, but I’m not sure how convenient it is. Any ideas ? case class KeyData(a: Int, b: Int) val gd1 = df.groupByKey(r=>KeyData(r.getInt(0), r.getInt(1))) Thanks, Narine > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14642) import org.apache.spark.sql.expressions._ breaks udf under functions
[ https://issues.apache.org/jira/browse/SPARK-14642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244899#comment-15244899 ] Yin Huai commented on SPARK-14642: -- [~sbcd90] Yea sure. I am not very sure about the right solution. But, having a PR can definitely help the discussion and help others better understand the problem :) > import org.apache.spark.sql.expressions._ breaks udf under functions > > > Key: SPARK-14642 > URL: https://issues.apache.org/jira/browse/SPARK-14642 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > The following code works > {code} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> udf((v: String) => v.stripSuffix("-abc")) > res0: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,StringType,Some(List(StringType))) > {code} > But, the following does not > {code} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> import org.apache.spark.sql.expressions._ > import org.apache.spark.sql.expressions._ > scala> udf((v: String) => v.stripSuffix("-abc")) > :30: error: No TypeTag available for String >udf((v: String) => v.stripSuffix("-abc")) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14642) import org.apache.spark.sql.expressions._ breaks udf under functions
[ https://issues.apache.org/jira/browse/SPARK-14642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244896#comment-15244896 ] Subhobrata Dey commented on SPARK-14642: Hello [~yhuai], I see that the issue gets resolved when the package {code:java} org.apache.spark.sql.expressions.scala {code} does not exist & the file {code:java} typed.scala {code} is put directly under the package {code:java} org.apache.spark.sql.expressions {code} in spark-sql_.jar Can I submit a PR for this? > import org.apache.spark.sql.expressions._ breaks udf under functions > > > Key: SPARK-14642 > URL: https://issues.apache.org/jira/browse/SPARK-14642 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > The following code works > {code} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> udf((v: String) => v.stripSuffix("-abc")) > res0: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,StringType,Some(List(StringType))) > {code} > But, the following does not > {code} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> import org.apache.spark.sql.expressions._ > import org.apache.spark.sql.expressions._ > scala> udf((v: String) => v.stripSuffix("-abc")) > :30: error: No TypeTag available for String >udf((v: String) => v.stripSuffix("-abc")) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13904) Add support for pluggable cluster manager
[ https://issues.apache.org/jira/browse/SPARK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244882#comment-15244882 ] Apache Spark commented on SPARK-13904: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/12457 > Add support for pluggable cluster manager > - > > Key: SPARK-13904 > URL: https://issues.apache.org/jira/browse/SPARK-13904 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Hemant Bhanawat > > Currently Spark allows only a few cluster managers viz Yarn, Mesos and > Standalone. But, as Spark is now being used in newer and different use cases, > there is a need for allowing other cluster managers to manage spark > components. One such use case is - embedding spark components like executor > and driver inside another process which may be a datastore. This allows > colocation of data and processing. Another requirement that stems from such a > use case is that the executors/driver should not take the parent process down > when they go down and the components can be relaunched inside the same > process again. > So, this JIRA requests two functionalities: > 1. Support for external cluster managers > 2. Allow a cluster manager to clean up the tasks without taking the parent > process down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14647) Group SQLContext/HiveContext state into PersistentState
[ https://issues.apache.org/jira/browse/SPARK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244880#comment-15244880 ] Yin Huai commented on SPARK-14647: -- Let me check the code and see if there is any suspicious place. > Group SQLContext/HiveContext state into PersistentState > --- > > Key: SPARK-14647 > URL: https://issues.apache.org/jira/browse/SPARK-14647 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > > This is analogous to SPARK-13526, which moved some things into > `SessionState`. After this issue we'll have an analogous `PersistentState` > that groups things to be shared across sessions. This will simplify the > constructors of the contexts significantly by allowing us to pass fewer > things into the contexts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14647) Group SQLContext/HiveContext state into PersistentState
[ https://issues.apache.org/jira/browse/SPARK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244879#comment-15244879 ] Yin Huai commented on SPARK-14647: -- Looking at the log, it seems that it took a long time to resolve maven dependency (that test is specific to hive 0.13 metastore. So, it will first download jars using ivy). > Group SQLContext/HiveContext state into PersistentState > --- > > Key: SPARK-14647 > URL: https://issues.apache.org/jira/browse/SPARK-14647 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > > This is analogous to SPARK-13526, which moved some things into > `SessionState`. After this issue we'll have an analogous `PersistentState` > that groups things to be shared across sessions. This will simplify the > constructors of the contexts significantly by allowing us to pass fewer > things into the contexts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14686) Implement a non-inheritable localProperty facility
[ https://issues.apache.org/jira/browse/SPARK-14686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244807#comment-15244807 ] Apache Spark commented on SPARK-14686: -- User 'marcintustin' has created a pull request for this issue: https://github.com/apache/spark/pull/12456 > Implement a non-inheritable localProperty facility > -- > > Key: SPARK-14686 > URL: https://issues.apache.org/jira/browse/SPARK-14686 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Marcin Tustin >Priority: Minor > > As discussed here: > http://mail-archives.apache.org/mod_mbox/spark-dev/201604.mbox/%3CCANXtaKA4ZdpiUbZPnzDBN8ZL7_RwKSMuz6n45ixnHVEBE1hgjg%40mail.gmail.com%3E > Spark localProperties are always inherited by spawned threads. There are > situations in which this is undesirable (notably spark.sql.execution.id and > any other localProperty that should always be cleaned up). This is a ticket > to implement a non-inheritable mechanism for localProperties. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14686) Implement a non-inheritable localProperty facility
[ https://issues.apache.org/jira/browse/SPARK-14686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14686: Assignee: (was: Apache Spark) > Implement a non-inheritable localProperty facility > -- > > Key: SPARK-14686 > URL: https://issues.apache.org/jira/browse/SPARK-14686 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Marcin Tustin >Priority: Minor > > As discussed here: > http://mail-archives.apache.org/mod_mbox/spark-dev/201604.mbox/%3CCANXtaKA4ZdpiUbZPnzDBN8ZL7_RwKSMuz6n45ixnHVEBE1hgjg%40mail.gmail.com%3E > Spark localProperties are always inherited by spawned threads. There are > situations in which this is undesirable (notably spark.sql.execution.id and > any other localProperty that should always be cleaned up). This is a ticket > to implement a non-inheritable mechanism for localProperties. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14686) Implement a non-inheritable localProperty facility
[ https://issues.apache.org/jira/browse/SPARK-14686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14686: Assignee: Apache Spark > Implement a non-inheritable localProperty facility > -- > > Key: SPARK-14686 > URL: https://issues.apache.org/jira/browse/SPARK-14686 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Marcin Tustin >Assignee: Apache Spark >Priority: Minor > > As discussed here: > http://mail-archives.apache.org/mod_mbox/spark-dev/201604.mbox/%3CCANXtaKA4ZdpiUbZPnzDBN8ZL7_RwKSMuz6n45ixnHVEBE1hgjg%40mail.gmail.com%3E > Spark localProperties are always inherited by spawned threads. There are > situations in which this is undesirable (notably spark.sql.execution.id and > any other localProperty that should always be cleaned up). This is a ticket > to implement a non-inheritable mechanism for localProperties. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13904) Add support for pluggable cluster manager
[ https://issues.apache.org/jira/browse/SPARK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244805#comment-15244805 ] Kazuaki Ishizaki commented on SPARK-13904: -- To merge this PR may have begun causing test failures. Would it be possible to look at these links? https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/ https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/626/ cf. [SPARK-14690] > Add support for pluggable cluster manager > - > > Key: SPARK-13904 > URL: https://issues.apache.org/jira/browse/SPARK-13904 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Hemant Bhanawat > > Currently Spark allows only a few cluster managers viz Yarn, Mesos and > Standalone. But, as Spark is now being used in newer and different use cases, > there is a need for allowing other cluster managers to manage spark > components. One such use case is - embedding spark components like executor > and driver inside another process which may be a datastore. This allows > colocation of data and processing. Another requirement that stems from such a > use case is that the executors/driver should not take the parent process down > when they go down and the components can be relaunched inside the same > process again. > So, this JIRA requests two functionalities: > 1. Support for external cluster managers > 2. Allow a cluster manager to clean up the tasks without taking the parent > process down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14690) [SQL] SPARK-8020 fails in Jenkins for master
[ https://issues.apache.org/jira/browse/SPARK-14690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-14690: - Comment: was deleted (was: Add a link to the original JIRA) > [SQL] SPARK-8020 fails in Jenkins for master > > > Key: SPARK-14690 > URL: https://issues.apache.org/jira/browse/SPARK-14690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kazuaki Ishizaki > > After merging a PR for [SPARK-14672], [SPARK-13904], and another one, a test > "SPARK-8020" fails. > Here is a result at amplab Jenkins. > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/ > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/626/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-14690) [SQL] SPARK-8020 fails in Jenkins for master
[ https://issues.apache.org/jira/browse/SPARK-14690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki closed SPARK-14690. Add a link to the original JIRA > [SQL] SPARK-8020 fails in Jenkins for master > > > Key: SPARK-14690 > URL: https://issues.apache.org/jira/browse/SPARK-14690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kazuaki Ishizaki > > After merging a PR for [SPARK-14672], [SPARK-13904], and another one, a test > "SPARK-8020" fails. > Here is a result at amplab Jenkins. > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/ > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/626/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14690) [SQL] SPARK-8020 fails in Jenkins for master
[ https://issues.apache.org/jira/browse/SPARK-14690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244798#comment-15244798 ] Kazuaki Ishizaki commented on SPARK-14690: -- I see. I will reopen the original JIRA soon. > [SQL] SPARK-8020 fails in Jenkins for master > > > Key: SPARK-14690 > URL: https://issues.apache.org/jira/browse/SPARK-14690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kazuaki Ishizaki > > After merging a PR for [SPARK-14672], [SPARK-13904], and another one, a test > "SPARK-8020" fails. > Here is a result at amplab Jenkins. > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/ > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/626/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14690) [SQL] SPARK-8020 fails in Jenkins for master
[ https://issues.apache.org/jira/browse/SPARK-14690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-14690. -- Resolution: Duplicate > [SQL] SPARK-8020 fails in Jenkins for master > > > Key: SPARK-14690 > URL: https://issues.apache.org/jira/browse/SPARK-14690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kazuaki Ishizaki > > After merging a PR for [SPARK-14672], [SPARK-13904], and another one, a test > "SPARK-8020" fails. > Here is a result at amplab Jenkins. > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/ > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/626/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14690) [SQL] SPARK-8020 fails in Jenkins for master
[ https://issues.apache.org/jira/browse/SPARK-14690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244784#comment-15244784 ] Sean Owen commented on SPARK-14690: --- Same, please reopen the JIRA whose resolution you believe caused the failures. This splits the thread of discussion for anyone following the original change. > [SQL] SPARK-8020 fails in Jenkins for master > > > Key: SPARK-14690 > URL: https://issues.apache.org/jira/browse/SPARK-14690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kazuaki Ishizaki > > After merging a PR for [SPARK-14672], [SPARK-13904], and another one, a test > "SPARK-8020" fails. > Here is a result at amplab Jenkins. > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/ > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/626/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-14647) Group SQLContext/HiveContext state into PersistentState
[ https://issues.apache.org/jira/browse/SPARK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-14647: --- Pardon, does look like this may have begun causing test failures: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/625/ Cf. SPARK-14689 > Group SQLContext/HiveContext state into PersistentState > --- > > Key: SPARK-14647 > URL: https://issues.apache.org/jira/browse/SPARK-14647 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > > This is analogous to SPARK-13526, which moved some things into > `SessionState`. After this issue we'll have an analogous `PersistentState` > that groups things to be shared across sessions. This will simplify the > constructors of the contexts significantly by allowing us to pass fewer > things into the contexts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14689) [SQL] SPARK-9757 fails in Jenkins for master
[ https://issues.apache.org/jira/browse/SPARK-14689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-14689. --- Resolution: Duplicate I think it doesn't help to make a new JIRA for this, as it splits the thread of discussion. I'm going to mark this a duplicate and reopen the other JIRA. > [SQL] SPARK-9757 fails in Jenkins for master > > > Key: SPARK-14689 > URL: https://issues.apache.org/jira/browse/SPARK-14689 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kazuaki Ishizaki >Priority: Blocker > > After merging a PR for [SPARK-14647], a test "SPARK-9757" fails. > Here is a result at amplab Jenkins. > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/625/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14690) [SQL] SPARK-8020 fails in Jenkins for master
[ https://issues.apache.org/jira/browse/SPARK-14690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-14690: - Summary: [SQL] SPARK-8020 fails in Jenkins for master (was: [SQL] SPARK-9757 fails in Jenkins for master) > [SQL] SPARK-8020 fails in Jenkins for master > > > Key: SPARK-14690 > URL: https://issues.apache.org/jira/browse/SPARK-14690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kazuaki Ishizaki > > After merging a PR for [SPARK-14672], [SPARK-13904], and another one, a test > "SPARK-8020" fails. > Here is a result at amplab Jenkins. > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/ > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/626/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14690) [SQL] SPARK-9757 fails in Jenkins for master
Kazuaki Ishizaki created SPARK-14690: Summary: [SQL] SPARK-9757 fails in Jenkins for master Key: SPARK-14690 URL: https://issues.apache.org/jira/browse/SPARK-14690 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Kazuaki Ishizaki After merging a PR for [SPARK-14672], [SPARK-13904], and another one, a test "SPARK-8020" fails. Here is a result at amplab Jenkins. https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/ https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/626/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14689) [SQL] SPARK-9757 fails in Jenkins for master
Kazuaki Ishizaki created SPARK-14689: Summary: [SQL] SPARK-9757 fails in Jenkins for master Key: SPARK-14689 URL: https://issues.apache.org/jira/browse/SPARK-14689 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Kazuaki Ishizaki Priority: Blocker After merging a PR for [SPARK-14647], a test "SPARK-9757" fails. Here is a result at amplab Jenkins. https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/625/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14688) pyspark textFileStream gzipped
[ https://issues.apache.org/jira/browse/SPARK-14688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244770#comment-15244770 ] Sean Owen commented on SPARK-14688: --- Please provide some detail? AFAIK it's just delegating to the same Hadoop APIs to read, right? > pyspark textFileStream gzipped > -- > > Key: SPARK-14688 > URL: https://issues.apache.org/jira/browse/SPARK-14688 > Project: Spark > Issue Type: Improvement > Components: PySpark, Streaming >Affects Versions: 1.6.1 >Reporter: seth > Labels: pyspark, streaming > > pyspark streamingObject does not support reading gzip files. > 2 notes: > 1.regular sparkContext does support gzip files > 2. Java/Scala method support streaming gzip files -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14688) pyspark textFileStream gzipped
seth created SPARK-14688: Summary: pyspark textFileStream gzipped Key: SPARK-14688 URL: https://issues.apache.org/jira/browse/SPARK-14688 Project: Spark Issue Type: Improvement Components: PySpark, Streaming Affects Versions: 1.6.1 Reporter: seth pyspark streamingObject does not support reading gzip files. 2 notes: 1.regular sparkContext does support gzip files 2. Java/Scala method support streaming gzip files -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14685) Properly document heritability of localProperties
[ https://issues.apache.org/jira/browse/SPARK-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14685: Assignee: Apache Spark > Properly document heritability of localProperties > - > > Key: SPARK-14685 > URL: https://issues.apache.org/jira/browse/SPARK-14685 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Marcin Tustin >Assignee: Apache Spark >Priority: Minor > > As discussed here: > http://mail-archives.apache.org/mod_mbox/spark-dev/201604.mbox/%3CCANXtaKA4ZdpiUbZPnzDBN8ZL7_RwKSMuz6n45ixnHVEBE1hgjg%40mail.gmail.com%3E > One thread spawned by another will inherit spark localProperties. This is not > currently documented, and there are no tests for that specific behaviour. > This is a ticket to document this behaviour, including its consequences, and > implement an appropriate test. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14685) Properly document heritability of localProperties
[ https://issues.apache.org/jira/browse/SPARK-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244692#comment-15244692 ] Apache Spark commented on SPARK-14685: -- User 'marcintustin' has created a pull request for this issue: https://github.com/apache/spark/pull/12455 > Properly document heritability of localProperties > - > > Key: SPARK-14685 > URL: https://issues.apache.org/jira/browse/SPARK-14685 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Marcin Tustin >Priority: Minor > > As discussed here: > http://mail-archives.apache.org/mod_mbox/spark-dev/201604.mbox/%3CCANXtaKA4ZdpiUbZPnzDBN8ZL7_RwKSMuz6n45ixnHVEBE1hgjg%40mail.gmail.com%3E > One thread spawned by another will inherit spark localProperties. This is not > currently documented, and there are no tests for that specific behaviour. > This is a ticket to document this behaviour, including its consequences, and > implement an appropriate test. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14685) Properly document heritability of localProperties
[ https://issues.apache.org/jira/browse/SPARK-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14685: Assignee: (was: Apache Spark) > Properly document heritability of localProperties > - > > Key: SPARK-14685 > URL: https://issues.apache.org/jira/browse/SPARK-14685 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Marcin Tustin >Priority: Minor > > As discussed here: > http://mail-archives.apache.org/mod_mbox/spark-dev/201604.mbox/%3CCANXtaKA4ZdpiUbZPnzDBN8ZL7_RwKSMuz6n45ixnHVEBE1hgjg%40mail.gmail.com%3E > One thread spawned by another will inherit spark localProperties. This is not > currently documented, and there are no tests for that specific behaviour. > This is a ticket to document this behaviour, including its consequences, and > implement an appropriate test. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results
[ https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244669#comment-15244669 ] Jurriaan Pruis commented on SPARK-14343: On the spark 2.0.0 nightly build it doesn't work at all: {code:none} >>> df=sqlContext.read.text('dataset') 16/04/17 16:11:34 INFO HDFSFileCatalog: Listing file:/Users/.../dataset on driver 16/04/17 16:11:34 INFO HDFSFileCatalog: Listing file:/Users/.../dataset/year=2014 on driver 16/04/17 16:11:34 INFO HDFSFileCatalog: Listing file:/Users/.../dataset/year=2015 on driver Traceback (most recent call last): File "", line 1, in File "/Users/.../Downloads/spark-2.0.0-SNAPSHOT-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 245, in text return self._df(self._jreader.text(self._sqlContext._sc._jvm.PythonUtils.toSeq(paths))) File "/Users/.../Downloads/spark-2.0.0-SNAPSHOT-bin-hadoop2.7/python/lib/py4j-0.9.2-src.zip/py4j/java_gateway.py", line 836, in __call__ File "/Users/.../Downloads/spark-2.0.0-SNAPSHOT-bin-hadoop2.7/python/pyspark/sql/utils.py", line 57, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u'Try to map struct to Tuple1, but failed as the number of fields does not line up.\n - Input schema: struct\n - Target schema: struct;' {code} > Dataframe operations on a partitioned dataset (using partition discovery) > return invalid results > > > Key: SPARK-14343 > URL: https://issues.apache.org/jira/browse/SPARK-14343 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: Mac OS X 10.11.4 >Reporter: Jurriaan Pruis > > When reading a dataset using {{sqlContext.read.text()}} queries on the > partitioned column return invalid results. > h2. How to reproduce: > h3. Generate datasets > {code:title=repro.sh} > #!/bin/sh > mkdir -p dataset/year=2014 > mkdir -p dataset/year=2015 > echo "data from 2014" > dataset/year=2014/part01.txt > echo "data from 2015" > dataset/year=2015/part01.txt > {code} > {code:title=repro2.sh} > #!/bin/sh > mkdir -p dataset2/month=june > mkdir -p dataset2/month=july > echo "data from june" > dataset2/month=june/part01.txt > echo "data from july" > dataset2/month=july/part01.txt > {code} > h3. using first dataset > {code:none} > >>> df = sqlContext.read.text('dataset') > ... > >>> df > DataFrame[value: string, year: int] > >>> df.show() > +--++ > | value|year| > +--++ > |data from 2014|2014| > |data from 2015|2015| > +--++ > >>> df.select('year').show() > ++ > |year| > ++ > | 14| > | 14| > ++ > {code} > This is clearly wrong. Seems like it returns the length of the value column? > h3. using second dataset > With another dataset it looks like this: > {code:none} > >>> df = sqlContext.read.text('dataset2') > >>> df > DataFrame[value: string, month: string] > >>> df.show() > +--+-+ > | value|month| > +--+-+ > |data from june| june| > |data from july| july| > +--+-+ > >>> df.select('month').show() > +--+ > | month| > +--+ > |data from june| > |data from july| > +--+ > {code} > Here it returns the value of the value column instead of the month partition. > h3. Workaround > When I convert the dataframe to an RDD and back to a DataFrame I get the > following result (which is the expected behaviour): > {code:none} > >>> df.rdd.toDF().select('month').show() > +-+ > |month| > +-+ > | june| > | july| > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results
[ https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jurriaan Pruis updated SPARK-14343: --- Affects Version/s: 2.0.0 > Dataframe operations on a partitioned dataset (using partition discovery) > return invalid results > > > Key: SPARK-14343 > URL: https://issues.apache.org/jira/browse/SPARK-14343 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: Mac OS X 10.11.4 >Reporter: Jurriaan Pruis > > When reading a dataset using {{sqlContext.read.text()}} queries on the > partitioned column return invalid results. > h2. How to reproduce: > h3. Generate datasets > {code:title=repro.sh} > #!/bin/sh > mkdir -p dataset/year=2014 > mkdir -p dataset/year=2015 > echo "data from 2014" > dataset/year=2014/part01.txt > echo "data from 2015" > dataset/year=2015/part01.txt > {code} > {code:title=repro2.sh} > #!/bin/sh > mkdir -p dataset2/month=june > mkdir -p dataset2/month=july > echo "data from june" > dataset2/month=june/part01.txt > echo "data from july" > dataset2/month=july/part01.txt > {code} > h3. using first dataset > {code:none} > >>> df = sqlContext.read.text('dataset') > ... > >>> df > DataFrame[value: string, year: int] > >>> df.show() > +--++ > | value|year| > +--++ > |data from 2014|2014| > |data from 2015|2015| > +--++ > >>> df.select('year').show() > ++ > |year| > ++ > | 14| > | 14| > ++ > {code} > This is clearly wrong. Seems like it returns the length of the value column? > h3. using second dataset > With another dataset it looks like this: > {code:none} > >>> df = sqlContext.read.text('dataset2') > >>> df > DataFrame[value: string, month: string] > >>> df.show() > +--+-+ > | value|month| > +--+-+ > |data from june| june| > |data from july| july| > +--+-+ > >>> df.select('month').show() > +--+ > | month| > +--+ > |data from june| > |data from july| > +--+ > {code} > Here it returns the value of the value column instead of the month partition. > h3. Workaround > When I convert the dataframe to an RDD and back to a DataFrame I get the > following result (which is the expected behaviour): > {code:none} > >>> df.rdd.toDF().select('month').show() > +-+ > |month| > +-+ > | june| > | july| > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13753) Column nullable is derived incorrectly
[ https://issues.apache.org/jira/browse/SPARK-13753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244640#comment-15244640 ] Takeshi Yamamuro commented on SPARK-13753: -- Could you also put the explain result of your query? > Column nullable is derived incorrectly > -- > > Key: SPARK-13753 > URL: https://issues.apache.org/jira/browse/SPARK-13753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Jingwei Lu >Priority: Critical > > There is a problem in spark sql to derive nullable column and used in > optimization incorrectly. In following query: > {code} > select concat("perf.realtime.web", b.tags[1]) as metric, b.value, b.tags[0] > from ( > select explode(map(a.frontend[0], > ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), > ",action:", COALESCE(action, "null")), ".p50"), > a.frontend[1], > ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), > ",action:", COALESCE(action, "null")), ".p90"), > a.backend[0], ARRAY(concat("metric:backend", > ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, > "null")), ".p50"), > a.backend[1], ARRAY(concat("metric:backend", > ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, > "null")), ".p90"), > a.render[0], ARRAY(concat("metric:render", > ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, > "null")), ".p50"), > a.render[1], ARRAY(concat("metric:render", > ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, > "null")), ".p90"), > a.page_load_time[0], > ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, > "null"), ",action:", COALESCE(action, "null")), ".p50"), > a.page_load_time[1], > ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, > "null"), ",action:", COALESCE(action, "null")), ".p90"), > a.total_load_time[0], > ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, > "null"), ",action:", COALESCE(action, "null")), ".p50"), > a.total_load_time[1], > ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, > "null"), ",action:", COALESCE(action, "null")), ".p90"))) as (value, tags) > from ( > select data.controller as controller, data.action as > action, > percentile(data.frontend, array(0.5, 0.9)) as > frontend, > percentile(data.backend, array(0.5, 0.9)) as > backend, > percentile(data.render, array(0.5, 0.9)) as render, > percentile(data.page_load_time, array(0.5, 0.9)) as > page_load_time, > percentile(data.total_load_time, array(0.5, 0.9)) > as total_load_time > from air_events_rt > where type='air_events' and data.event_name='pageload' > group by data.controller, data.action > ) a > ) b > where b.value is not null > {code} > b.value is incorrectly derived as not nullable. "b.value is not null" > predicate will be ignored by optimizer which cause the query return incorrect > result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14635) Documentation and Examples for TF-IDF only refer to HashingTF
[ https://issues.apache.org/jira/browse/SPARK-14635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244575#comment-15244575 ] Apache Spark commented on SPARK-14635: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/12454 > Documentation and Examples for TF-IDF only refer to HashingTF > - > > Key: SPARK-14635 > URL: https://issues.apache.org/jira/browse/SPARK-14635 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > Currently, the [docs for > TF-IDF|http://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf] > only refer to using {{HashingTF}} with {{IDF}}. However, {{CountVectorizer}} > can also be used. We should probably amend the user guide and examples to > show this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14635) Documentation and Examples for TF-IDF only refer to HashingTF
[ https://issues.apache.org/jira/browse/SPARK-14635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14635: Assignee: Apache Spark > Documentation and Examples for TF-IDF only refer to HashingTF > - > > Key: SPARK-14635 > URL: https://issues.apache.org/jira/browse/SPARK-14635 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Nick Pentreath >Assignee: Apache Spark >Priority: Minor > > Currently, the [docs for > TF-IDF|http://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf] > only refer to using {{HashingTF}} with {{IDF}}. However, {{CountVectorizer}} > can also be used. We should probably amend the user guide and examples to > show this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14635) Documentation and Examples for TF-IDF only refer to HashingTF
[ https://issues.apache.org/jira/browse/SPARK-14635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14635: Assignee: (was: Apache Spark) > Documentation and Examples for TF-IDF only refer to HashingTF > - > > Key: SPARK-14635 > URL: https://issues.apache.org/jira/browse/SPARK-14635 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > Currently, the [docs for > TF-IDF|http://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf] > only refer to using {{HashingTF}} with {{IDF}}. However, {{CountVectorizer}} > can also be used. We should probably amend the user guide and examples to > show this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14681) Provide label/impurity stats for spark.ml decision tree nodes
[ https://issues.apache.org/jira/browse/SPARK-14681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244574#comment-15244574 ] zhengruifeng commented on SPARK-14681: -- Will this stats be inclued in trainingSummary or non-trainingSummary evaluated on some dataframe? > Provide label/impurity stats for spark.ml decision tree nodes > - > > Key: SPARK-14681 > URL: https://issues.apache.org/jira/browse/SPARK-14681 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > Currently, spark.ml decision trees provide all node info except for the > aggregated stats about labels and impurities. This task is to provide those > publicly. We need to choose a good API for it, so we should discuss the > design on this issue before implementing it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14283) Avoid sort in randomSplit when possible
[ https://issues.apache.org/jira/browse/SPARK-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14283: Assignee: Apache Spark > Avoid sort in randomSplit when possible > --- > > Key: SPARK-14283 > URL: https://issues.apache.org/jira/browse/SPARK-14283 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > Dataset.randomSplit sorts each partition in order to guarantee an ordering > and make randomSplit deterministic given the seed. Since randomSplit is used > a fair amount in ML, it would be great to avoid the sort when possible. > Are there cases when it could be avoided? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14283) Avoid sort in randomSplit when possible
[ https://issues.apache.org/jira/browse/SPARK-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244571#comment-15244571 ] Apache Spark commented on SPARK-14283: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/12453 > Avoid sort in randomSplit when possible > --- > > Key: SPARK-14283 > URL: https://issues.apache.org/jira/browse/SPARK-14283 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Joseph K. Bradley > > Dataset.randomSplit sorts each partition in order to guarantee an ordering > and make randomSplit deterministic given the seed. Since randomSplit is used > a fair amount in ML, it would be great to avoid the sort when possible. > Are there cases when it could be avoided? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14283) Avoid sort in randomSplit when possible
[ https://issues.apache.org/jira/browse/SPARK-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14283: Assignee: (was: Apache Spark) > Avoid sort in randomSplit when possible > --- > > Key: SPARK-14283 > URL: https://issues.apache.org/jira/browse/SPARK-14283 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Joseph K. Bradley > > Dataset.randomSplit sorts each partition in order to guarantee an ordering > and make randomSplit deterministic given the seed. Since randomSplit is used > a fair amount in ML, it would be great to avoid the sort when possible. > Are there cases when it could be avoided? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14283) Avoid sort in randomSplit when possible
[ https://issues.apache.org/jira/browse/SPARK-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244568#comment-15244568 ] zhengruifeng commented on SPARK-14283: -- [~josephkb] I can work on this. There should be a version of randomSplit that avoid the local sort which is meaningless in ML. But the calls in ML should be add a extra param to avoid local sort IMO. > Avoid sort in randomSplit when possible > --- > > Key: SPARK-14283 > URL: https://issues.apache.org/jira/browse/SPARK-14283 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Joseph K. Bradley > > Dataset.randomSplit sorts each partition in order to guarantee an ordering > and make randomSplit deterministic given the seed. Since randomSplit is used > a fair amount in ML, it would be great to avoid the sort when possible. > Are there cases when it could be avoided? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13363) Aggregator not working with DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244566#comment-15244566 ] Apache Spark commented on SPARK-13363: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/12451 > Aggregator not working with DataFrame > - > > Key: SPARK-13363 > URL: https://issues.apache.org/jira/browse/SPARK-13363 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: koert kuipers >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 2.0.0 > > > org.apache.spark.sql.expressions.Aggregator doc/comments says: A base class > for user-defined aggregations, which can be used in [[DataFrame]] and > [[Dataset]] > it works well with Dataset/GroupedDataset, but i am having no luck using it > with DataFrame/GroupedData. does anyone have an example how to use it with a > DataFrame? > in particular i would like to use it with this method in GroupedData: > {noformat} > def agg(expr: Column, exprs: Column*): DataFrame > {noformat} > clearly it should be possible, since GroupedDataset uses that very same > method to do the work: > {noformat} > private def agg(exprs: Column*): DataFrame = > groupedData.agg(withEncoder(exprs.head), exprs.tail.map(withEncoder): _*) > {noformat} > the trick seems to be the wrapping in withEncoder, which is private. i tried > to do something like it myself, but i had no luck since it uses more private > stuff in TypedColumn. > anyhow, my attempt at using it in DataFrame: > {noformat} > val simpleSum = new Aggregator[Int, Int, Int] { > def zero: Int = 0 // The initial value. > def reduce(b: Int, a: Int) = b + a// Add an element to the running total > def merge(b1: Int, b2: Int) = b1 + b2 // Merge intermediate values. > def finish(b: Int) = b// Return the final result. > }.toColumn > val df = sc.makeRDD(1 to 3).map(i => (i, i)).toDF("k", "v") > df.groupBy("k").agg(simpleSum).show > {noformat} > and the resulting error: > {noformat} > org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate > [k#104], [k#104,($anon$3(),mode=Complete,isDistinct=false) AS sum#106]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:46) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:241) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:122) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:46) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:49) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org