[jira] [Resolved] (SPARK-3455) **HotFix** Unit test failed due to can not resolve the attribute references
[ https://issues.apache.org/jira/browse/SPARK-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3455. - Resolution: Fixed **HotFix** Unit test failed due to can not resolve the attribute references --- Key: SPARK-3455 URL: https://issues.apache.org/jira/browse/SPARK-3455 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Blocker The test case SPARK-3349 partitioning after limit failed, the exception as : {panel} 23:10:04.117 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 274.0 failed 1 times; aborting job [info] - SPARK-3349 partitioning after limit *** FAILED *** [info] Exception thrown while executing query: [info] == Parsed Logical Plan == [info] Project [*] [info]Join Inner, Some(('subset1.n = 'lowerCaseData.n)) [info] UnresolvedRelation None, lowerCaseData, None [info] UnresolvedRelation None, subset1, None [info] [info] == Analyzed Logical Plan == [info] Project [n#605,l#606,n#12] [info]Join Inner, Some((n#12 = n#605)) [info] SparkLogicalPlan (ExistingRdd [n#605,l#606], MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:219) [info] Limit 2 [info] Sort [n#12 DESC] [info] Distinct [info]Project [n#12] [info] SparkLogicalPlan (ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:219) [info] [info] == Optimized Logical Plan == [info] Project [n#605,l#606,n#12] [info]Join Inner, Some((n#12 = n#605)) [info] SparkLogicalPlan (ExistingRdd [n#605,l#606], MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:219) [info] Limit 2 [info] Sort [n#12 DESC] [info] Distinct [info]Project [n#12] [info] SparkLogicalPlan (ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:219) [info] [info] == Physical Plan == [info] Project [n#605,l#606,n#12] [info]ShuffledHashJoin [n#605], [n#12], BuildRight [info] Exchange (HashPartitioning [n#605], 10) [info] ExistingRdd [n#605,l#606], MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:219 [info] Exchange (HashPartitioning [n#12], 10) [info] TakeOrdered 2, [n#12 DESC] [info] Distinct false [info]Exchange (HashPartitioning [n#12], 10) [info] Distinct true [info] Project [n#12] [info] ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:219 [info] [info] Code Generation: false [info] == RDD == [info] == Exception == [info] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: [info] Exchange (HashPartitioning [n#12], 10) [info]TakeOrdered 2, [n#12 DESC] [info] Distinct false [info] Exchange (HashPartitioning [n#12], 10) [info] Distinct true [info]Project [n#12] [info] ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:219 [info] [info] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: [info] Exchange (HashPartitioning [n#12], 10) [info]TakeOrdered 2, [n#12 DESC] [info] Distinct false [info] Exchange (HashPartitioning [n#12], 10) [info] Distinct true [info]Project [n#12] [info] ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:219 [info] [info]at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) [info]at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44) [info]at org.apache.spark.sql.execution.ShuffledHashJoin.execute(joins.scala:354) [info]at org.apache.spark.sql.execution.Project.execute(basicOperators.scala:42) [info]at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85) [info]at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438) [info]at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:40) [info]at org.apache.spark.sql.SQLQuerySuite$$anonfun$31.apply$mcV$sp(SQLQuerySuite.scala:369) [info]at org.apache.spark.sql.SQLQuerySuite$$anonfun$31.apply(SQLQuerySuite.scala:362) [info]at org.apache.spark.sql.SQLQuerySuite$$anonfun$31.apply(SQLQuerySuite.scala:362) [info]at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) [info]at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) [info]at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info]at
[jira] [Updated] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2883: Priority: Blocker (was: Major) Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: Bug Components: Input/Output, SQL Reporter: Zhan Zhang Priority: Blocker Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 pm jobtracker.png Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2883: Target Version/s: 1.2.0 Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: Bug Components: Input/Output, SQL Reporter: Zhan Zhang Priority: Blocker Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 pm jobtracker.png Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3481) HiveComparisonTest throws exception of org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: default
[ https://issues.apache.org/jira/browse/SPARK-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132567#comment-14132567 ] Apache Spark commented on SPARK-3481: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/2377 HiveComparisonTest throws exception of org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: default --- Key: SPARK-3481 URL: https://issues.apache.org/jira/browse/SPARK-3481 Project: Spark Issue Type: Test Components: SQL Reporter: Cheng Hao Priority: Minor In local test, lots of exception raised like: {panel} 11:08:01.746 ERROR hive.ql.exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: default at org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3480) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:298) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272) at org.apache.spark.sql.hive.test.TestHiveContext.runSqlHive(TestHive.scala:88) at org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:348) at org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply$mcV$sp(HiveComparisonTest.scala:255) at org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:225) at org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:225) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158) at org.scalatest.Suite$class.withFixture(Suite.scala:1121) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167) at org.scalatest.FunSuite.runTest(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200) at org.scalatest.FunSuite.runTests(FunSuite.scala:1559) at org.scalatest.Suite$class.run(Suite.scala:1423) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204) at org.apache.spark.sql.hive.execution.HiveComparisonTest.org$scalatest$BeforeAndAfterAll$$super$run(HiveComparisonTest.scala:41) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at
[jira] [Commented] (SPARK-3491) Use pickle to serialize the data in MLlib Python
[ https://issues.apache.org/jira/browse/SPARK-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132590#comment-14132590 ] Apache Spark commented on SPARK-3491: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2378 Use pickle to serialize the data in MLlib Python Key: SPARK-3491 URL: https://issues.apache.org/jira/browse/SPARK-3491 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Davies Liu Assignee: Davies Liu Currently, we write the code for serialization/deserialization in Python and Scala manually, it can not scale to the big number of MLlib API. If the serialization could be done in pickle (using Pyrolite in JVM) in extensional way, then it should be much easy to add Python API for MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2098) All Spark processes should support spark-defaults.conf, config file
[ https://issues.apache.org/jira/browse/SPARK-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132593#comment-14132593 ] Apache Spark commented on SPARK-2098: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/2379 All Spark processes should support spark-defaults.conf, config file --- Key: SPARK-2098 URL: https://issues.apache.org/jira/browse/SPARK-2098 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Marcelo Vanzin Assignee: Guoqiang Li SparkSubmit supports the idea of a config file to set SparkConf configurations. This is handy because you can easily set a site-wide configuration file, and power users can use their own when needed, or resort to JVM properties or other means of overriding configs. It would be nice if all Spark processes (e.g. master / worker / history server) also supported something like this. For daemon processes this is particularly interesting because it makes it easy to decouple starting the daemon (e.g. some /etc/init.d script packaged by some distribution) from configuring that daemon. Right now you have to set environment variables to modify the configuration of those daemons, which is not very friendly to the above scenario. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132598#comment-14132598 ] Saisai Shao commented on SPARK-2926: Ok, I will take a try and let you know then it is ready. Thanks a lot. Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle -- Key: SPARK-2926 URL: https://issues.apache.org/jira/browse/SPARK-2926 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.1.0 Reporter: Saisai Shao Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test Report(contd).pdf, Spark Shuffle Test Report.pdf Currently Spark has already integrated sort-based shuffle write, which greatly improve the IO performance and reduce the memory consumption when reducer number is very large. But for the reducer side, it still adopts the implementation of hash-based shuffle reader, which neglects the ordering attributes of map output data in some situations. Here we propose a MR style sort-merge like shuffle reader for sort-based shuffle to better improve the performance of sort-based shuffle. Working in progress code and performance test report will be posted later when some unit test bugs are fixed. Any comments would be greatly appreciated. Thanks a lot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132598#comment-14132598 ] Saisai Shao edited comment on SPARK-2926 at 9/13/14 8:09 AM: - Ok, I will take a try and let you know when it is ready. Thanks a lot. was (Author: jerryshao): Ok, I will take a try and let you know then it is ready. Thanks a lot. Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle -- Key: SPARK-2926 URL: https://issues.apache.org/jira/browse/SPARK-2926 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.1.0 Reporter: Saisai Shao Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test Report(contd).pdf, Spark Shuffle Test Report.pdf Currently Spark has already integrated sort-based shuffle write, which greatly improve the IO performance and reduce the memory consumption when reducer number is very large. But for the reducer side, it still adopts the implementation of hash-based shuffle reader, which neglects the ordering attributes of map output data in some situations. Here we propose a MR style sort-merge like shuffle reader for sort-based shuffle to better improve the performance of sort-based shuffle. Working in progress code and performance test report will be posted later when some unit test bugs are fixed. Any comments would be greatly appreciated. Thanks a lot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3518) Remove useless statement in JsonProtocol
Kousuke Saruta created SPARK-3518: - Summary: Remove useless statement in JsonProtocol Key: SPARK-3518 URL: https://issues.apache.org/jira/browse/SPARK-3518 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta Priority: Minor In org.apache.spark.util.JsonProtocol#taskInfoToJson, a variable named accumUpdateMap is created as follows. {code} val accumUpdateMap = taskInfo.accumulables {code} But accumUpdateMap is never used and there is 2nd invocation of taskInfo.accumlables as follows. {code} (Accumulables - JArray(taskInfo.accumulables.map(accumulableInfoToJson).toList)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3518) Remove useless statement in JsonProtocol
[ https://issues.apache.org/jira/browse/SPARK-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132608#comment-14132608 ] Apache Spark commented on SPARK-3518: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2380 Remove useless statement in JsonProtocol Key: SPARK-3518 URL: https://issues.apache.org/jira/browse/SPARK-3518 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta Priority: Minor In org.apache.spark.util.JsonProtocol#taskInfoToJson, a variable named accumUpdateMap is created as follows. {code} val accumUpdateMap = taskInfo.accumulables {code} But accumUpdateMap is never used and there is 2nd invocation of taskInfo.accumlables as follows. {code} (Accumulables - JArray(taskInfo.accumulables.map(accumulableInfoToJson).toList)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark
[ https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Helena Edelson updated SPARK-2593: -- Description: As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node in an Akka application. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. was: As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node in an Akka application. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. I would like to create an Akka Extension that wraps around Spark/Spark Streaming and Cassandra. So the programmatic creation would simply be this for a user val extension = SparkCassandra(system) Add ability to pass an existing Akka ActorSystem into Spark --- Key: SPARK-2593 URL: https://issues.apache.org/jira/browse/SPARK-2593 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Helena Edelson As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node in an Akka application. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark
[ https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132797#comment-14132797 ] Helena Edelson commented on SPARK-2593: --- Here is a good example of just one of the issues: it is difficult to locate a remote spark actor to publish data to the stream. Here I have to have the streaming actor get created and in the preStart, publish a custom message with `self`which my actors in my ActorSystem can receive in order to get the ActorRef to send to. This is incredibly clunky. I will try to carve out some time to do this PR this week. Add ability to pass an existing Akka ActorSystem into Spark --- Key: SPARK-2593 URL: https://issues.apache.org/jira/browse/SPARK-2593 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Helena Edelson As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node in an Akka application. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3519) PySpark RDDs are missing the distinct(n) method
Nicholas Chammas created SPARK-3519: --- Summary: PySpark RDDs are missing the distinct(n) method Key: SPARK-3519 URL: https://issues.apache.org/jira/browse/SPARK-3519 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Nicholas Chammas {{distinct()}} works but {{distinct(N)}} doesn't. {code} sc.parallelize([1,1,2]).distinct() PythonRDD[15] at RDD at PythonRDD.scala:43 sc.parallelize([1,1,2]).distinct(2) Traceback (most recent call last): File stdin, line 1, in module TypeError: distinct() takes exactly 1 argument (2 given) {code} The PySpark docs only call out [the {{distinct()}} signature|http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#distinct], but the programming guide [includes the {{distinct(N)}} signature|http://spark.apache.org/docs/1.1.0/programming-guide.html#transformations] as well. {quote} {noformat} distinct([numTasks])) Return a new dataset that contains the distinct elements of the source dataset. {noformat} {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3519) PySpark RDDs are missing the distinct(n) method
[ https://issues.apache.org/jira/browse/SPARK-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132833#comment-14132833 ] Nicholas Chammas commented on SPARK-3519: - [~joshrosen] [~davies]: Here is a ticket for the missing {{distinct(N)}} method. I marked it as a bug since the programming guide says it should exist. PySpark RDDs are missing the distinct(n) method --- Key: SPARK-3519 URL: https://issues.apache.org/jira/browse/SPARK-3519 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Nicholas Chammas {{distinct()}} works but {{distinct(N)}} doesn't. {code} sc.parallelize([1,1,2]).distinct() PythonRDD[15] at RDD at PythonRDD.scala:43 sc.parallelize([1,1,2]).distinct(2) Traceback (most recent call last): File stdin, line 1, in module TypeError: distinct() takes exactly 1 argument (2 given) {code} The PySpark docs only call out [the {{distinct()}} signature|http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#distinct], but the programming guide [includes the {{distinct(N)}} signature|http://spark.apache.org/docs/1.1.0/programming-guide.html#transformations] as well. {quote} {noformat} distinct([numTasks])) Return a new dataset that contains the distinct elements of the source dataset. {noformat} {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3407) Add Date type support
[ https://issues.apache.org/jira/browse/SPARK-3407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3407: Target Version/s: 1.2.0 Add Date type support - Key: SPARK-3407 URL: https://issues.apache.org/jira/browse/SPARK-3407 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3407) Add Date type support
[ https://issues.apache.org/jira/browse/SPARK-3407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3407: Assignee: Adrian Wang Add Date type support - Key: SPARK-3407 URL: https://issues.apache.org/jira/browse/SPARK-3407 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Assignee: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2562) Add Date datatype support to Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2562. - Resolution: Duplicate Add Date datatype support to Spark SQL -- Key: SPARK-2562 URL: https://issues.apache.org/jira/browse/SPARK-2562 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.1 Reporter: Zongheng Yang Priority: Minor Spark SQL currently supports Timestamp, but not Date. Hive introduced support for Date in [HIVE-4055|https://issues.apache.org/jira/browse/HIVE-4055], where the underlying representation is {{java.sql.Date}}. (Thanks to user Rindra Ramamonjison for reporting this.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2594) Add CACHE TABLE name AS SELECT ...
[ https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132876#comment-14132876 ] Apache Spark commented on SPARK-2594: - User 'ravipesala' has created a pull request for this issue: https://github.com/apache/spark/pull/2381 Add CACHE TABLE name AS SELECT ... Key: SPARK-2594 URL: https://issues.apache.org/jira/browse/SPARK-2594 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names
[ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-3414: - Assignee: Michael Armbrust (was: Cheng Lian) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names Key: SPARK-3414 URL: https://issues.apache.org/jira/browse/SPARK-3414 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Cheng Lian Assignee: Michael Armbrust Priority: Critical Fix For: 1.2.0 Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs) sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles) val srdd = sql( SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name ) srdd.registerTempTable(boom) sql(select * from boom) {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is not lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} batch is only executed before the {{Resolution}} batch. And when {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} notice that attributes referenced in the join operator (esp. {{rawLogs}}) is not lowercased yet. And then, when {{select * from boom}} is been analyzed, its input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}} above, which is later discovered by rule {{ResolveRelations}}, thus not touched by {{CaseInsensitiveAttributeReferences}} at all, and {{rawLogs.filename}} is thus not lowercased: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names
[ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132893#comment-14132893 ] Apache Spark commented on SPARK-3414: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/2382 Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names Key: SPARK-3414 URL: https://issues.apache.org/jira/browse/SPARK-3414 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Cheng Lian Assignee: Michael Armbrust Priority: Critical Fix For: 1.2.0 Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs) sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles) val srdd = sql( SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name ) srdd.registerTempTable(boom) sql(select * from boom) {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is not lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} batch is only executed before the {{Resolution}} batch. And when {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} notice that attributes referenced in the join operator (esp. {{rawLogs}}) is not lowercased yet. And then, when {{select * from boom}} is been analyzed, its input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}} above, which is later discovered by rule {{ResolveRelations}}, thus not touched by {{CaseInsensitiveAttributeReferences}} at all, and {{rawLogs.filename}} is thus not lowercased: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3481) HiveComparisonTest throws exception of org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: default
[ https://issues.apache.org/jira/browse/SPARK-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3481. - Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Cheng Hao HiveComparisonTest throws exception of org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: default --- Key: SPARK-3481 URL: https://issues.apache.org/jira/browse/SPARK-3481 Project: Spark Issue Type: Test Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor Fix For: 1.2.0 In local test, lots of exception raised like: {panel} 11:08:01.746 ERROR hive.ql.exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: default at org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3480) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:298) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272) at org.apache.spark.sql.hive.test.TestHiveContext.runSqlHive(TestHive.scala:88) at org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:348) at org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply$mcV$sp(HiveComparisonTest.scala:255) at org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:225) at org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:225) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158) at org.scalatest.Suite$class.withFixture(Suite.scala:1121) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167) at org.scalatest.FunSuite.runTest(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200) at org.scalatest.FunSuite.runTests(FunSuite.scala:1559) at org.scalatest.Suite$class.run(Suite.scala:1423) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204) at org.apache.spark.sql.hive.execution.HiveComparisonTest.org$scalatest$BeforeAndAfterAll$$super$run(HiveComparisonTest.scala:41) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at
[jira] [Commented] (SPARK-3519) PySpark RDDs are missing the distinct(n) method
[ https://issues.apache.org/jira/browse/SPARK-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132963#comment-14132963 ] Apache Spark commented on SPARK-3519: - User 'mattf' has created a pull request for this issue: https://github.com/apache/spark/pull/2383 PySpark RDDs are missing the distinct(n) method --- Key: SPARK-3519 URL: https://issues.apache.org/jira/browse/SPARK-3519 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Nicholas Chammas Assignee: Matthew Farrellee {{distinct()}} works but {{distinct(N)}} doesn't. {code} sc.parallelize([1,1,2]).distinct() PythonRDD[15] at RDD at PythonRDD.scala:43 sc.parallelize([1,1,2]).distinct(2) Traceback (most recent call last): File stdin, line 1, in module TypeError: distinct() takes exactly 1 argument (2 given) {code} The PySpark docs only call out [the {{distinct()}} signature|http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#distinct], but the programming guide [includes the {{distinct(N)}} signature|http://spark.apache.org/docs/1.1.0/programming-guide.html#transformations] as well. {quote} {noformat} distinct([numTasks])) Return a new dataset that contains the distinct elements of the source dataset. {noformat} {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3294) Avoid boxing/unboxing when handling in-memory columnar storage
[ https://issues.apache.org/jira/browse/SPARK-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3294. - Resolution: Fixed Fix Version/s: 1.2.0 Avoid boxing/unboxing when handling in-memory columnar storage -- Key: SPARK-3294 URL: https://issues.apache.org/jira/browse/SPARK-3294 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.2, 1.1.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical Fix For: 1.2.0 When Spark SQL in-memory columnar storage was implemented, we tried to avoid boxing/unboxing costs as much as possible, but {{javap}} shows that there still exist code that involves boxing/unboxing on critical paths due to type erasure, especially methods of sub-classes of {{ColumnType}}. Should eliminate them whenever possible for better performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3030) reuse python worker
[ https://issues.apache.org/jira/browse/SPARK-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3030. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2259 [https://github.com/apache/spark/pull/2259] reuse python worker --- Key: SPARK-3030 URL: https://issues.apache.org/jira/browse/SPARK-3030 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.2.0 Currently, it will fork an Python worker for each task, it will better if we can reuse the worker for later tasks. This will be very useful for large dataset with big broadcast, so it does not need to sending broadcast to worker again and again. Also it can reduce the overhead of launch a task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3485) should check parameter type when find constructors
[ https://issues.apache.org/jira/browse/SPARK-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3485: Target Version/s: 1.2.0 should check parameter type when find constructors -- Key: SPARK-3485 URL: https://issues.apache.org/jira/browse/SPARK-3485 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang In hiveUdfs, we get constructors of primitivetypes by find a constructor which takes only one parameter. This is very dangerous when more than one constructors match. When the sequence of primitiveTypes becomes larger, the problem would occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3515) ParquetMetastoreSuite fails when executed together with other suites under Maven
[ https://issues.apache.org/jira/browse/SPARK-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3515. - Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Cheng Lian ParquetMetastoreSuite fails when executed together with other suites under Maven Key: SPARK-3515 URL: https://issues.apache.org/jira/browse/SPARK-3515 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.1 Reporter: Cheng Lian Assignee: Cheng Lian Fix For: 1.2.0 Reproduction step: {code} mvn -Phive,hadoop-2.4 -DwildcardSuites=org.apache.spark.sql.parquet.ParquetMetastoreSuite,org.apache.spark.sql.hive.StatisticsSuite -pl core,sql/catalyst,sql/core,sql/hive test {code} Maven instantiates all discovered test suite object at first, and then starts executing all test cases. {{ParquetMetastoreSuite}} sets up several temporary tables in constructor, but these tables are deleted immediately since {{StatisticsSuite}}'s constructor calls {{TestHiveContext.reset()}}. To fix this issue, we shouldn't put this kind of side effect in constructor, but in {{beforeAll}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3501) Hive SimpleUDF will create duplicated type cast which cause exception in constant folding
[ https://issues.apache.org/jira/browse/SPARK-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3501: Target Version/s: 1.2.0 Hive SimpleUDF will create duplicated type cast which cause exception in constant folding - Key: SPARK-3501 URL: https://issues.apache.org/jira/browse/SPARK-3501 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor When do the query like: select datediff(cast(value as timestamp), cast('2002-03-21 00:00:00' as timestamp)) from src; SparkSQL will raise exception: {panel} [info] - Cast Timestamp to Timestamp in UDF *** FAILED *** [info] scala.MatchError: TimestampType (of class org.apache.spark.sql.catalyst.types.TimestampType$) [info] at org.apache.spark.sql.catalyst.expressions.Cast.castToTimestamp(Cast.scala:77) [info] at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:251) [info] at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) [info] at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) [info] at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:217) [info] at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:210) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:180) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3501) Hive SimpleUDF will create duplicated type cast which cause exception in constant folding
[ https://issues.apache.org/jira/browse/SPARK-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3501: Assignee: Cheng Hao Hive SimpleUDF will create duplicated type cast which cause exception in constant folding - Key: SPARK-3501 URL: https://issues.apache.org/jira/browse/SPARK-3501 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor When do the query like: select datediff(cast(value as timestamp), cast('2002-03-21 00:00:00' as timestamp)) from src; SparkSQL will raise exception: {panel} [info] - Cast Timestamp to Timestamp in UDF *** FAILED *** [info] scala.MatchError: TimestampType (of class org.apache.spark.sql.catalyst.types.TimestampType$) [info] at org.apache.spark.sql.catalyst.expressions.Cast.castToTimestamp(Cast.scala:77) [info] at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:251) [info] at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) [info] at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) [info] at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:217) [info] at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:210) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:180) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3438) Support for accessing secured HDFS in Standalone Mode
[ https://issues.apache.org/jira/browse/SPARK-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3438: --- Description: Access to secured HDFS is currently supported in YARN using YARN's built in security mechanism. In YARN mode, a user application is authenticated when it is submitted, then it acquires delegation tokens and them ship them (via YARN) securely to workers. In Standalone mode, it would be nice to support a more mechanism for accessing HDFS where we rely on a single shared secret to authenticate communication in the standalone cluster. 1. A company is running a standalone cluster. 2. They are fine if all Spark jobs in the cluster share a global secret, i.e. all Spark jobs can trust one another. 3. They are able to provide a Hadoop login on the driver node via a keytab or kinit. They want tokens from this login to be distributed to the executors to allow access to secure HDFS. 4. They also don't want to trust the network on the cluster. I.e. don't want to allow someone to fetch HDFS tokens easily over a known protocol, without authentication. was:Secured HDFS is supported in YARN currently, but not in standalone mode. The tricky bit is how disseminate the delegation tokens securely in standalone mode. Support for accessing secured HDFS in Standalone Mode - Key: SPARK-3438 URL: https://issues.apache.org/jira/browse/SPARK-3438 Project: Spark Issue Type: New Feature Components: Deploy, Spark Core Affects Versions: 1.0.2 Reporter: Zhanfeng Huo Access to secured HDFS is currently supported in YARN using YARN's built in security mechanism. In YARN mode, a user application is authenticated when it is submitted, then it acquires delegation tokens and them ship them (via YARN) securely to workers. In Standalone mode, it would be nice to support a more mechanism for accessing HDFS where we rely on a single shared secret to authenticate communication in the standalone cluster. 1. A company is running a standalone cluster. 2. They are fine if all Spark jobs in the cluster share a global secret, i.e. all Spark jobs can trust one another. 3. They are able to provide a Hadoop login on the driver node via a keytab or kinit. They want tokens from this login to be distributed to the executors to allow access to secure HDFS. 4. They also don't want to trust the network on the cluster. I.e. don't want to allow someone to fetch HDFS tokens easily over a known protocol, without authentication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3463) Show metrics about spilling in Python
[ https://issues.apache.org/jira/browse/SPARK-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3463. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2336 [https://github.com/apache/spark/pull/2336] Show metrics about spilling in Python - Key: SPARK-3463 URL: https://issues.apache.org/jira/browse/SPARK-3463 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.2.0 It should also show the number of bytes spilled into disks while doing aggregation in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1087) Separate file for traceback and callsite related functions
[ https://issues.apache.org/jira/browse/SPARK-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133094#comment-14133094 ] Apache Spark commented on SPARK-1087: - User 'staple' has created a pull request for this issue: https://github.com/apache/spark/pull/2385 Separate file for traceback and callsite related functions -- Key: SPARK-1087 URL: https://issues.apache.org/jira/browse/SPARK-1087 Project: Spark Issue Type: New Feature Components: PySpark Reporter: Jyotiska NK Right now, _extract_concise_traceback() is written inside rdd.py which provides the callsite information. But for [SPARK-972](https://spark-project.atlassian.net/browse/SPARK-972) in PR #581, we used the function from context.py. Also some issues were faced regarding the return string format. It would be a good idea to move the the traceback function from rdd and create a separate file for future developments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org