[GitHub] spark issue #13644: [SPARK-15925][SQL][SPARKR] Replaces registerTempTable wi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13644 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10953: [SPARK-12177] [STREAMING] Update KafkaDStreams to new Ka...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/10953 Yeah, I agree with @koeninger. This PR is pretty out of date, it makes sense to turn focus on Cody's PR #11863 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13651: [SPARK-15776][SQL] Divide Expression inside Aggre...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13651#discussion_r66880282 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -525,7 +525,7 @@ object TypeCoercion { def apply(plan: LogicalPlan): LogicalPlan = plan resolveExpressions { // Skip nodes who has not been resolved yet, // as this is an extra rule which should be applied at last. - case e if !e.resolved => e + case e if !e.childrenResolved => e // Decimal and Double remain the same --- End diff -- We can simplify this: ``` case e if !e.childrenResolved => e case d: Divide if d.dataType.isInstanceOf[IntegralType] => ... ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13638: [SPARK-15915][SQL] CacheManager should use canoni...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/13638#discussion_r66880127 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -155,8 +156,9 @@ private[sql] class CacheManager extends Logging { * function will over invalidate. */ private[sql] def invalidateCache(plan: LogicalPlan): Unit = writeLock { +val canonicalized = plan.canonicalized cachedData.foreach { - case data if data.plan.collect { case p if p.sameResult(plan) => p }.nonEmpty => + case data if data.plan.collect { case p if p.sameResult(canonicalized) => p }.nonEmpty => --- End diff -- I don't think so. For example, if the cached plan is `LocalRelation` (which is canonicalized) and the `plan` argument is `SubqueryAlias(LocalRelation)` (which is not canonicalized), it will fail to find the same-result plan. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13651: [SPARK-15776][SQL] Divide Expression inside Aggre...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13651#discussion_r66879827 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala --- @@ -213,7 +213,7 @@ case class Multiply(left: Expression, right: Expression) case class Divide(left: Expression, right: Expression) extends BinaryArithmetic with NullIntolerant { - override def inputType: AbstractDataType = NumericType --- End diff -- we should also cleanup the `divide` expression to remove code for integral division. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13651: [SPARK-15776][SQL] Divide Expression inside Aggre...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13651#discussion_r66879738 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2847,4 +2847,15 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { test("SPARK-15887: hive-site.xml should be loaded") { assert(spark.sessionState.newHadoopConf().get("hive.in.test") == "true") } + + test("SPARK-15776 Divide expression inside an Aggregation function should not " + --- End diff -- I think we need some low level unit test instead of end-to-end test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13651: [SPARK-15776][SQL] Divide Expression inside Aggregation ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13651 **[Test build #60439 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60439/consoleFull)** for PR 13651 at commit [`df08eea`](https://github.com/apache/spark/commit/df08eeacd85187ca5a71463fc5d25f63426ebe84). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13651: [SPARK-15776][SQL] Divide Expression inside Aggre...
GitHub user clockfly opened a pull request: https://github.com/apache/spark/pull/13651 [SPARK-15776][SQL] Divide Expression inside Aggregation function is casted to wrong type ## What changes were proposed in this pull request? This PR fixes the problem that Divide Expression inside Aggregation function is casted to wrong type. After the fix, the behavior is consistent with Hive. **Before the change:** ``` scala> sql("select sum(1 / 2) as a").schema res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,LongType,true)) scala> sql("select sum(1 / 2) as a").show() +---+ | a| +---+ |0 | +---+ ``` **After the change:** ``` scala> sql("select sum(1 / 2) as a").schema res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,DoubleType,true)) scala> sql("select sum(1 / 2) as a").show() +---+ | a| +---+ |0.5| +---+ ``` ## How was this patch tested? Unit test. This PR is based on https://github.com/apache/spark/pull/13524 by @Sephiroth-Lin You can merge this pull request into a Git repository by running: $ git pull https://github.com/clockfly/spark SPARK-15776 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13651.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13651 commit df08eeacd85187ca5a71463fc5d25f63426ebe84 Author: Sean ZhongDate: 2016-06-13T22:09:20Z SPARK-15776 Divide Expression inside an Aggregation function is casted to wrong type --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13338: [SPARK-13723] [YARN] Change behavior of --num-executors ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13338 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13338: [SPARK-13723] [YARN] Change behavior of --num-executors ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13338 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60427/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13338: [SPARK-13723] [YARN] Change behavior of --num-executors ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13338 **[Test build #60427 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60427/consoleFull)** for PR 13338 at commit [`bf22b5a`](https://github.com/apache/spark/commit/bf22b5ab0bc8369949ac33833b078e7e13c7ce35). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13592#discussion_r66878387 --- Diff: docs/sql-programming-guide.md --- @@ -1650,14 +1646,15 @@ SELECT * FROM jsonTable ## Hive Tables Spark SQL also supports reading and writing data stored in [Apache Hive](http://hive.apache.org/). -However, since Hive has a large number of dependencies, it is not included in the default Spark assembly. -Hive support is enabled by adding the `-Phive` and `-Phive-thriftserver` flags to Spark's build. -This command builds a new assembly directory that includes Hive. Note that this Hive assembly directory must also be present -on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries -(SerDes) in order to access data stored in Hive. +However, since Hive has a large number of dependencies, these dependencies are not included in the +default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them +automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as +they will need access to the Hive serialization and deserialization libraries (SerDes) in order to +access data stored in Hive. -Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` (for security configuration), -`hdfs-site.xml` (for HDFS configuration) file in `conf/`. +Configuration of Hive is done by placing your `core-site.xml` (for security configuration), +`hdfs-site.xml` (for HDFS configuration) file in `conf/`, and adding configurations in your +`hive-site.xml` into `conf/spark-defaults.conf`. --- End diff -- it will not be true soon, users only need to put `hive-site.xml` in classpath --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #7963: [SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/7963 Bump? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13592#discussion_r66877689 --- Diff: docs/sql-programming-guide.md --- @@ -604,49 +607,47 @@ JavaRDD people = sc.textFile("examples/src/main/resources/people.txt").m }); // Apply a schema to an RDD of JavaBeans and register it as a table. -DataFrame schemaPeople = sqlContext.createDataFrame(people, Person.class); +Dataset schemaPeople = spark.createDataFrame(people, Person.class); schemaPeople.createOrReplaceTempView("people"); // SQL can be run over RDDs that have been registered as tables. -DataFrame teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") +Dataset teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") -// The results of SQL queries are DataFrames and support all the normal RDD operations. // The columns of a row in the result can be accessed by ordinal. -List teenagerNames = teenagers.javaRDD().map(new Function() { +List teenagerNames = teenagers.map(new MapFunction
() { public String call(Row row) { return "Name: " + row.getString(0); } -}).collect(); +}).collectAsList(); {% endhighlight %} + --- End diff -- looks like it's still valid in python --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13592#discussion_r66877316 --- Diff: docs/sql-programming-guide.md --- @@ -587,7 +590,7 @@ for the JavaBean. {% highlight java %} // sc is an existing JavaSparkContext. -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc); +SparkSession spark = new org.apache.spark.sql.SparkSession(sc); // Load a text file and convert each line to a JavaBean. JavaRDD people = sc.textFile("examples/src/main/resources/people.txt").map( --- End diff -- is this example still valid? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13592#discussion_r66877088 --- Diff: docs/sql-programming-guide.md --- @@ -517,24 +517,26 @@ types such as Sequences or Arrays. This RDD can be implicitly converted to a Dat registered as a table. Tables can be used in subsequent SQL statements. {% highlight scala %} -// sc is an existing SparkContext. -val sqlContext = new org.apache.spark.sql.SQLContext(sc) +val spark: SparkSession // An existing SparkSession // this is used to implicitly convert an RDD to a DataFrame. -import sqlContext.implicits._ +import spark.implicits._ // Define the schema using a case class. // Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit, // you can use custom classes that implement the Product interface. case class Person(name: String, age: Int) -// Create an RDD of Person objects and register it as a table. -val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF() +// Create an RDD of Person objects and register it as a temporary view. +val people = sc + .textFile("examples/src/main/resources/people.txt") + .map(_.split(",")) + .map(p => Person(p(0), p(1).trim.toInt)) + .toDF() --- End diff -- There is no reflection anymore, now we always use the type `T` to create encoder and serialize the object. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12938: [SPARK-15162][SPARK-15164][PySpark][DOCS][ML] update som...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/12938 What keeps causing this failure? Is it the change in conf.py? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13649: [SPARK-15929] Fix portability of DataFrameSuite path glo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13649 **[Test build #60438 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60438/consoleFull)** for PR 13649 at commit [`a466517`](https://github.com/apache/spark/commit/a46651794d701370d673b362019274fe76a2ff29). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13498: [SPARK-15011][SQL] Re-enable 'analyze MetastoreRelations...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13498 **[Test build #3095 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3095/consoleFull)** for PR 13498 at commit [`655b0c7`](https://github.com/apache/spark/commit/655b0c73a54cbad3ac3c611a3c869feffbe9a1b5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13498: [SPARK-15011][SQL] Re-enable 'analyze MetastoreRelations...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13498 **[Test build #3093 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3093/consoleFull)** for PR 13498 at commit [`655b0c7`](https://github.com/apache/spark/commit/655b0c73a54cbad3ac3c611a3c869feffbe9a1b5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13498: [SPARK-15011][SQL] Re-enable 'analyze MetastoreRelations...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13498 **[Test build #3094 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3094/consoleFull)** for PR 13498 at commit [`655b0c7`](https://github.com/apache/spark/commit/655b0c73a54cbad3ac3c611a3c869feffbe9a1b5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13611: [SPARK-15887][SQL] Bring back the hive-site.xml s...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13611 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13498: [SPARK-15011][SQL] Re-enable 'analyze MetastoreRelations...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13498 **[Test build #3092 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3092/consoleFull)** for PR 13498 at commit [`655b0c7`](https://github.com/apache/spark/commit/655b0c73a54cbad3ac3c611a3c869feffbe9a1b5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13498: [SPARK-15011][SQL] Re-enable 'analyze MetastoreRelations...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13498 **[Test build #3091 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3091/consoleFull)** for PR 13498 at commit [`655b0c7`](https://github.com/apache/spark/commit/655b0c73a54cbad3ac3c611a3c869feffbe9a1b5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13649: [SPARK-15929] Fix portability of DataFrameSuite path glo...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/13649 Wow, weird test failure: ``` Running Spark unit tests [info] Running Spark tests using SBT with these arguments: -Pyarn -Phadoop-2.3 -Phive-thriftserver -Phive -Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest,org.apache.spark.tags.ExtendedYarnTest hive-thriftserver/test mllib/test hive/test examples/test sql/test Using /usr/java/jdk1.8.0_60 as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0 [info] Loading project definition from /home/jenkins/workspace/SparkPullRequestBuilder/project [CodeBlob (0x7fe7b0214e90)] Framesize: 2 Runtime Stub (0x7fe7b0214e90): handle_exception_from_callee Runtime1 stub Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled [CodeBlob (0x7fe7b0214e90)] Framesize: 2 Runtime Stub (0x7fe7b0214e90): handle_exception_from_callee Runtime1 stub [thread 140627161380608 also had an error]# # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (sharedRuntime.cpp:834), pid=37507, tid=140631972611840 # fatal error: exception happened outside interpreter, nmethods and vtable stubs at pc 0x7fe7b0214f71 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /home/jenkins/workspace/SparkPullRequestBuilder/hs_err_pid37507.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # /home/jenkins/workspace/SparkPullRequestBuilder/build/sbt-launch-lib.bash: line 72: 37507 Aborted (core dumped) "$@" ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13649: [SPARK-15929] Fix portability of DataFrameSuite path glo...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/13649 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13498: [SPARK-15011][SQL] Re-enable 'analyze MetastoreRelations...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13498 **[Test build #3090 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3090/consoleFull)** for PR 13498 at commit [`655b0c7`](https://github.com/apache/spark/commit/655b0c73a54cbad3ac3c611a3c869feffbe9a1b5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13498: [SPARK-15011][SQL] Re-enable 'analyze MetastoreRelations...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13498 **[Test build #3089 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3089/consoleFull)** for PR 13498 at commit [`655b0c7`](https://github.com/apache/spark/commit/655b0c73a54cbad3ac3c611a3c869feffbe9a1b5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13498: [SPARK-15011][SQL] Re-enable 'analyze MetastoreRelations...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13498 **[Test build #60437 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60437/consoleFull)** for PR 13498 at commit [`655b0c7`](https://github.com/apache/spark/commit/655b0c73a54cbad3ac3c611a3c869feffbe9a1b5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13611: [SPARK-15887][SQL] Bring back the hive-site.xml support ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13611 **[Test build #60422 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60422/consoleFull)** for PR 13611 at commit [`8b53b22`](https://github.com/apache/spark/commit/8b53b226f0347c545bd13525d6d18bcf6f9a097e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13611: [SPARK-15887][SQL] Bring back the hive-site.xml support ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/13611 Thanks. Merging to master and branch 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13638: [SPARK-15915][SQL] CacheManager should use canonicalized...
Github user marmbrus commented on the issue: https://github.com/apache/spark/pull/13638 Seems reasonable. Is this a regression from 1.6? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13638: [SPARK-15915][SQL] CacheManager should use canoni...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/13638#discussion_r66876008 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -155,8 +156,9 @@ private[sql] class CacheManager extends Logging { * function will over invalidate. */ private[sql] def invalidateCache(plan: LogicalPlan): Unit = writeLock { +val canonicalized = plan.canonicalized cachedData.foreach { - case data if data.plan.collect { case p if p.sameResult(plan) => p }.nonEmpty => + case data if data.plan.collect { case p if p.sameResult(canonicalized) => p }.nonEmpty => --- End diff -- I think this is redundant, `sameResult` already compares the canonicalized plan. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13649: [SPARK-15929] Fix portability of DataFrameSuite path glo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13649 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60434/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13649: [SPARK-15929] Fix portability of DataFrameSuite path glo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13649 **[Test build #60434 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60434/consoleFull)** for PR 13649 at commit [`a466517`](https://github.com/apache/spark/commit/a46651794d701370d673b362019274fe76a2ff29). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13649: [SPARK-15929] Fix portability of DataFrameSuite path glo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13649 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13611: [SPARK-15887][SQL] Bring back the hive-site.xml support ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13611 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13563: [SPARK-15826] [CORE] PipedRDD to allow configurab...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/13563#discussion_r66875458 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -734,12 +737,14 @@ abstract class RDD[T: ClassTag]( printPipeContext: (String => Unit) => Unit = null, printRDDElement: (T, String => Unit) => Unit = null, separateWorkingDir: Boolean = false, - bufferSize: Int = 8192): RDD[String] = withScope { + bufferSize: Int = 8192, + encoding: Charset = StandardCharsets.UTF_8): RDD[String] = withScope { --- End diff -- > I will use String instead. Is that fine ? ð --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13611: [SPARK-15887][SQL] Bring back the hive-site.xml support ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13611 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60422/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13592#discussion_r66875336 --- Diff: docs/sql-programming-guide.md --- @@ -171,9 +171,9 @@ df.show() {% highlight r %} -sqlContext <- SQLContext(sc) +spark <- SparkSession(sc) --- End diff -- SparkR doesn't have SparkSession --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13592#discussion_r66875208 --- Diff: docs/sql-programming-guide.md --- @@ -145,10 +145,10 @@ df.show() {% highlight java %} -JavaSparkContext sc = ...; // An existing JavaSparkContext. -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc); +SparkSession spark = ...; // An existing SparkSession. +SparkSession spark = new org.apache.spark.sql.SparkSession(sc); --- End diff -- hm? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13648: [SQL][DOC][minor] document the contract of encoder seria...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13648 Please create a JIRA ticket. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13413: [SPARK-15663][SQL] SparkSession.catalog.listFunctions sh...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13413 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60421/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13413: [SPARK-15663][SQL] SparkSession.catalog.listFunctions sh...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13413 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13592#discussion_r66874954 --- Diff: docs/sql-programming-guide.md --- @@ -12,130 +12,130 @@ title: Spark SQL and DataFrames Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result +interact with Spark SQL including SQL and the Datasets API. When computing a result the same execution engine is used, independent of which API/language you are using to express the -computation. This unification means that developers can easily switch back and forth between the -various APIs based on which provides the most natural way to express a given transformation. +computation. This unification means that developers can easily switch back and forth between +different APIs based on which provides the most natural way to express a given transformation. All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` shell. ## SQL -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL. +One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running -SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames). +SQL from within another programming language the results will be returned as a [Dataset\[Row\]](#datasets). You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli) or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server). -## DataFrames +## Datasets and DataFrames -A DataFrame is a distributed collection of data organized into named columns. It is conceptually -equivalent to a table in a relational database or a data frame in R/Python, but with richer -optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such -as: structured data files, tables in Hive, external databases, or existing RDDs. +A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong +typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized +execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then +manipulated using functional transformations (map, flatMap, filter, etc.). -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame), -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html), -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html). +The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark +2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s. +In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets]. +However, [Java API][java-datasets] users must use `Dataset` instead. -## Datasets +[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset +[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html -A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of -RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's -optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated -using functional transformations (map, flatMap, filter, etc.). +Python does not have support for the Dataset API, but due to its dynamic nature many of the +benefits are already available (i.e. you can access the field of a row by name naturally +`row.columnName`). The case for R is similar. -The unified Dataset API can be used both in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and -[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does not yet have support for -the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can -access the field of a row by name naturally `row.columnName`).
[GitHub] spark issue #13413: [SPARK-15663][SQL] SparkSession.catalog.listFunctions sh...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13413 **[Test build #60421 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60421/consoleFull)** for PR 13413 at commit [`75665be`](https://github.com/apache/spark/commit/75665beb74f9a16979dad9161206b863573021b1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13650: [SPARK-9623] [ML] Provide variance for RandomForestRegre...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13650 **[Test build #60436 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60436/consoleFull)** for PR 13650 at commit [`0e4e82f`](https://github.com/apache/spark/commit/0e4e82fb778e94aa4641b63e09d848a0362e5939). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13650: [SPARK-9623] [ML] Provide variance for RandomForestRegre...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13650 cc: @yanboliang @MLnick --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13592#discussion_r66873712 --- Diff: docs/sql-programming-guide.md --- @@ -12,130 +12,130 @@ title: Spark SQL and DataFrames Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result +interact with Spark SQL including SQL and the Datasets API. When computing a result the same execution engine is used, independent of which API/language you are using to express the -computation. This unification means that developers can easily switch back and forth between the -various APIs based on which provides the most natural way to express a given transformation. +computation. This unification means that developers can easily switch back and forth between +different APIs based on which provides the most natural way to express a given transformation. All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` shell. ## SQL -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL. --- End diff -- why change this line? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13563: [SPARK-15826] [CORE] PipedRDD to allow configurab...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/13563#discussion_r66873430 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -734,12 +737,14 @@ abstract class RDD[T: ClassTag]( printPipeContext: (String => Unit) => Unit = null, printRDDElement: (T, String => Unit) => Unit = null, separateWorkingDir: Boolean = false, - bufferSize: Int = 8192): RDD[String] = withScope { + bufferSize: Int = 8192, + encoding: Charset = StandardCharsets.UTF_8): RDD[String] = withScope { --- End diff -- @zsxwing >> Use Codec for Scala API. `Codec` is scala specific. I intentionally did not use that because I wanted Java and Scala APIs to accept the same param types. Anyways, based on the test case failures, none of Charset or Codec would work because they need to be serializable (I see `org.apache.spark.SparkException: Task not serializable` while running this change). I will use `String` instead. Is that fine ? >> I suggest using Codec.defaultCharsetCodec as the default value Thanks for catching that. Its unfortunate to not have UTF8 as the default ... but backward compatibility is far more important. ``` Caused by: java.io.NotSerializableException: sun.nio.cs.UTF_8 Serialization stack: - object not serializable (class: sun.nio.cs.UTF_8, value: UTF-8) - field (class: org.apache.spark.rdd.PipedRDD, name: org$apache$spark$rdd$PipedRDD$$encoding, type: class java.nio.charset.Charset) - object (class org.apache.spark.rdd.PipedRDD, PipedRDD[1] at pipe at :29) - field (class: org.apache.spark.rdd.RDD$$anonfun$collect$1, name: $outer, type: class org.apache.spark.rdd.RDD) - object (class org.apache.spark.rdd.RDD$$anonfun$collect$1, ) - field (class: org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12, name: $outer, type: class org.apache.spark.rdd.RDD$$anonfun$collect$1) - object (class org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12, ) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301) ... 56 more ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13650: [SPARK-9623] [ML] Provide variance for RandomForestRegre...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13650 **[Test build #60435 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60435/consoleFull)** for PR 13650 at commit [`75254c9`](https://github.com/apache/spark/commit/75254c91cf8d9c2f3638a3f9b1cfd5c029e10996). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13482: [SPARK-15725][YARN] Ensure ApplicationMaster sleeps for ...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/13482 Ok, I'm fine with this as a work around for now since you don't really know and this will ensure it, but please clean up the code so that its clear which sleep is which and add a nice comment stating why we are doing this. Then I think we should file another jira to investigate a more proper fix for this. We shouldn't have to wait for reason to schedule, --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13650: [SPARK-9623] [ML] Provide variance for RandomFore...
GitHub user MechCoder opened a pull request: https://github.com/apache/spark/pull/13650 [SPARK-9623] [ML] Provide variance for RandomForestRegressor predictions ## What changes were proposed in this pull request? It is useful to get the variance of predictions from the `RandomForestRegressor` to plot confidence intervals on the predictions. I verified the formula from page 17 of this paper (http://arxiv.org/pdf/1211.0906v2.pdf) ## How was this patch tested? I added a couple of tests to the RandomForestRegression test suite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MechCoder/spark random_forest_var Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13650.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13650 commit 75254c91cf8d9c2f3638a3f9b1cfd5c029e10996 Author: MechCoderDate: 2016-06-09T18:22:53Z [SPARK-9623] [ML] Provide variance for RandomForestRegressor predictions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13592#discussion_r66872913 --- Diff: docs/sql-programming-guide.md --- @@ -12,130 +12,130 @@ title: Spark SQL and DataFrames Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result +interact with Spark SQL including SQL and the Datasets API. When computing a result --- End diff -- how `, DataFrame API(python/R) and Dataset API(scala/java)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13649: [SPARK-15929] Fix portability of DataFrameSuite path glo...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/13649 /cc @liancheng for review (since you reviewed the original tests in #11775). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13649: [SPARK-15929] Fix portability of DataFrameSuite path glo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13649 **[Test build #60434 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60434/consoleFull)** for PR 13649 at commit [`a466517`](https://github.com/apache/spark/commit/a46651794d701370d673b362019274fe76a2ff29). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13623: [SPARK-15895][SQL] Filters out metadata files while doin...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13623 @rxin Thanks. Consolidated all the underscore- and dot-files filtering logic. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13649: [SPARK-15929] Fix portability of DataFrameSuite p...
GitHub user JoshRosen opened a pull request: https://github.com/apache/spark/pull/13649 [SPARK-15929] Fix portability of DataFrameSuite path globbing tests The DataFrameSuite regression tests for SPARK-13774 fail in my environment because they attempt to glob over all of `/mnt` and some of the subdirectories restrictive permissions which cause the test to fail. This patch rewrites those tests to remove all environment-specific assumptions; the tests now create their own unique temporary paths for use in the tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/JoshRosen/spark SPARK-15929 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13649.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13649 commit a46651794d701370d673b362019274fe76a2ff29 Author: Josh RosenDate: 2016-06-08T19:43:37Z Clean up environment assumptions in test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12951: [SPARK-15176][Core] Add maxShares setting to Pools
Github user njwhite commented on the issue: https://github.com/apache/spark/pull/12951 @squito is this OK? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13603: [SPARK-15865][CORE] Blacklist should not result in job h...
Github user kayousterhout commented on the issue: https://github.com/apache/spark/pull/13603 Ohh good point that makes sense re: lost executors. Given that, I agree that this approach seems like the right one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13603: [SPARK-15865][CORE] Blacklist should not result i...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/13603#discussion_r66870624 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -280,10 +280,54 @@ private[spark] class TaskSchedulerImpl( } } } +if (!launchedTask && isTaskSetCompletelyBlacklisted(taskSet)) { + taskSet.abort(s"Aborting ${taskSet.taskSet} because it has a task which cannot be scheduled" + +s" on any executor due to blacklists.") +} return launchedTask } /** + * Check whether the given task set has been blacklisted to the point that it can't run anywhere. + * + * It is possible that this taskset has become impossible to schedule *anywhere* due to the + * blacklist. The most common scenario would be if there are fewer executors than + * spark.task.maxFailures. We need to detect this so we can fail the task set, otherwise the job + * will hang. + * + * The check here is a balance between being sure to catch the issue, but not wasting + * too much time inside the scheduling loop. Just check if the last task is schedulable + * on any of the available executors. So this is O(numExecutors) worst-case, but it'll + * really be fast unless you've got a bunch of things blacklisted. Its possible it won't detect + * the unschedulable task immediately, but if it returns false, there is at least *some* task + * that is schedulable, and after scheduling all of those, we'll eventually find the unschedulable + * task. + */ + private[scheduler] def isTaskSetCompletelyBlacklisted( --- End diff -- I think it would be cleaner to add this method to the TaskSetManager class (and then you don't need the pollPendingTask method) -- and then just pass in the executorsByHost map. That also makes things a little easier to change in the future, if there gets to be some easier way of checking if a particular task set is completely blacklisted. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13637: [SPARK-15914][SQL] Add deprecated method back to SQLCont...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13637 **[Test build #60433 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60433/consoleFull)** for PR 13637 at commit [`04ef1b5`](https://github.com/apache/spark/commit/04ef1b557cf4267e85c98993c11e7f6a6a31b6c8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13648: [SQL][DOC][minor] document the contract of encoder seria...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13648 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13593: [SPARK-15864] [SQL] Fix Inconsistent Behaviors when Unca...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13593 hi @gatorsmile , do you wanna update it? now both `tryUncacheQuery` and `uncacheQuery` won't unregister accumulator --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13636: [SPARK-15637][SPARKR] Remove R version check since maske...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/13636 For reference, I was using ``` R version 3.3.0 (2016-05-03) -- "Supposedly Educational" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13646 **[Test build #60432 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60432/consoleFull)** for PR 13646 at commit [`3e47166`](https://github.com/apache/spark/commit/3e471665505ba0b259fcd7b4a69d2c4ae1f5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13648: [SQL][DOC][minor] document the contract of encoder seria...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13648 **[Test build #60431 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60431/consoleFull)** for PR 13648 at commit [`cdda303`](https://github.com/apache/spark/commit/cdda303ed624aaf3389fb190ff8c473f06afa681). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13645: [HOTFIX] Revert "[MINOR][SQL] Standardize 'continuous qu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13645 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60416/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13645: [HOTFIX] Revert "[MINOR][SQL] Standardize 'continuous qu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13645 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13648: [SQL][DOC][minor] document the contract of encoder seria...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13648 cc @hvanhovell @liancheng @clockfly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13648: [SQL][DOC][minor] document the contract of encode...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/13648 [SQL][DOC][minor] document the contract of encoder serializer expressions ## What changes were proposed in this pull request? In our encoder framework, we imply that serializer expressions should use `BoundReference` to refer to the input object, and a lot of codes depend on this contract(e.g. ExpressionEncoder.tuple). This PR adds some document and assert in `ExpressionEncoder` to make it clearer. ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark comment Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13648.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13648 commit cdda303ed624aaf3389fb190ff8c473f06afa681 Author: Wenchen FanDate: 2016-06-13T20:26:01Z document the contract of encoder serializer expressions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13645: [HOTFIX] Revert "[MINOR][SQL] Standardize 'continuous qu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13645 **[Test build #60416 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60416/consoleFull)** for PR 13645 at commit [`2199031`](https://github.com/apache/spark/commit/21990313db506ac13eb7a29f3dd9f2022712cafd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13636: [SPARK-15637][SPARKR] Remove R version check since maske...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/13636 I'm still seeing errors after this change in my environment: ``` Failed - 1. Failure: Check masked functions (@test_context.R#31) length(maskedBySparkR) not equal to length(namesOfMasked). 1/1 mismatches [1] 22 - 20 == 2 2. Failure: Check masked functions (@test_context.R#32) sort(maskedBySparkR) not equal to sort(namesOfMasked). Lengths differ: 22 vs 20 DONE === ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13524: [SPARK-15776][SQL] Type coercion incorrect
Github user clockfly commented on the issue: https://github.com/apache/spark/pull/13524 @Sephiroth-Lin I think you can use a simpler case in the description of this PR. Such as: ``` select sum(4/3) ``` The expected result is: ``` 1.3.. ``` The actual result is: ``` 1 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13623: [SPARK-15895][SQL] Filters out metadata files while doin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13623 **[Test build #60430 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60430/consoleFull)** for PR 13623 at commit [`ee438e4`](https://github.com/apache/spark/commit/ee438e466e9b5368f821e5cac580393ecf8921ef). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13539: [SPARK-15795] [SQL] Enable more optimizations in whole s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13539 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13539: [SPARK-15795] [SQL] Enable more optimizations in whole s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13539 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60412/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13645: [HOTFIX] Revert "[MINOR][SQL] Standardize 'continuous qu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13645 **[Test build #3081 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3081/consoleFull)** for PR 13645 at commit [`2199031`](https://github.com/apache/spark/commit/21990313db506ac13eb7a29f3dd9f2022712cafd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13539: [SPARK-15795] [SQL] Enable more optimizations in whole s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13539 **[Test build #60412 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60412/consoleFull)** for PR 13539 at commit [`186283e`](https://github.com/apache/spark/commit/186283e9321120b9a8def7a3ba51ecf5c423e049). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13646 **[Test build #60419 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60419/consoleFull)** for PR 13646 at commit [`42a8d16`](https://github.com/apache/spark/commit/42a8d16ed0b7e8175a58d1d6fa21685cc36c85c2). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13646 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13646 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60419/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13646 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13140 **[Test build #60429 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60429/consoleFull)** for PR 13140 at commit [`2f7ffbd`](https://github.com/apache/spark/commit/2f7ffbd58a3437898f32e7603ca6b603f5fd5088). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13482: [SPARK-15725][YARN] Ensure ApplicationMaster sleeps for ...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/13482 @tgravescs, removing `notifyAll` doesn't solve the problem entirely, it just removes one path that's causing the `allocate` call to be run too many times. (Also, I haven't tested delaying loss reasons in our Spark jobs at scale, other than for the 200ms introduced here.) Ensuring that `allocate` is not called too often addresses the problem no matter what the immediate cause is. That's why I think it's a good idea to fix the two separately: first ensure that `allocate` will not run too often and starve other operations on the `YarnAllocator` and, second, track down the cases that cause this. Even if we were to fix the `YarnAllocator` so we don't have resource contention, ensuring a min interval between calls to `allocate` is a good idea so that Spark doesn't make too many useless calls to the resource manager. And, I don't want to track down this same bug in 3 months because of a different path trigger it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13613: [SPARK-15889][SQL][STREAMING] Add a unique id to ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13613 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13444: [SPARK-15530][SQL] Set #parallelism for file list...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13444 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13613: [SPARK-15889][SQL][STREAMING] Add a unique id to Continu...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/13613 Thanks. Merging to master and 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13644: [SPARK-15925][SQL][SPARKR] Replaces registerTempTable wi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13644 **[Test build #60428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60428/consoleFull)** for PR 13644 at commit [`56a3b9e`](https://github.com/apache/spark/commit/56a3b9e17659f0ea391e6627e4e2136397af4447). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/13513 @tdas @zsxwing , what is your comment about this PR? Thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13221: [SPARK-15443][SQL][Streaming] Properly explain co...
Github user jerryshao closed the pull request at: https://github.com/apache/spark/pull/13221 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13221: [SPARK-15443][SQL][Streaming] Properly explain continuou...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/13221 I'm going to close until I have a thorough fix about this issue, thanks a lot for your comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13444: [SPARK-15530][SQL] Set #parallelism for file listing in ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/13444 Thanks! Merging to master and branch 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13613: [SPARK-15889][SQL][STREAMING] Add a unique id to Continu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13613 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60413/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13613: [SPARK-15889][SQL][STREAMING] Add a unique id to Continu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13613 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13137: [SPARK-15247][SQL] Set the default number of partitions ...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13137 @maropu Just had an offline discussion with @yhuai. So this case is a little bit different from #13444. In #13444, the number of leaf files is unknown before issuing the job, and each task may take one or more directories and further list them recursively, thus increasing parallelism is potentially useful. Plus that listing leaf files may suffer from data skew (one directory containing significantly more files than others). In the Parquet schema reading case, the file number is already known, and there's no data skew problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13613: [SPARK-15889][SQL][STREAMING] Add a unique id to Continu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13613 **[Test build #60413 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60413/consoleFull)** for PR 13613 at commit [`4971da3`](https://github.com/apache/spark/commit/4971da3598685ab5c9c0274dda95412bc01bedfe). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13338: [SPARK-13723] [YARN] Change behavior of --num-exe...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/13338#discussion_r66862738 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2309,21 +2310,24 @@ private[spark] object Utils extends Logging { } /** - * Return whether dynamic allocation is enabled in the given conf - * Dynamic allocation and explicitly setting the number of executors are inherently - * incompatible. In environments where dynamic allocation is turned on by default, - * the latter should override the former (SPARK-9092). + * Return whether dynamic allocation is enabled in the given conf. */ def isDynamicAllocationEnabled(conf: SparkConf): Boolean = { -val numExecutor = conf.getInt("spark.executor.instances", 0) val dynamicAllocationEnabled = conf.getBoolean("spark.dynamicAllocation.enabled", false) -if (numExecutor != 0 && dynamicAllocationEnabled) { - logWarning("Dynamic Allocation and num executors both set, thus dynamic allocation disabled.") -} -numExecutor == 0 && dynamicAllocationEnabled && +dynamicAllocationEnabled && (!isLocalMaster(conf) || conf.getBoolean("spark.dynamicAllocation.testing", false)) } + /** + * Return the initial number of executors for dynamic allocation. + */ + def getDynamicAllocationInitialExecutors(conf: SparkConf): Int = { +Seq( + conf.get(DYN_ALLOCATION_MIN_EXECUTORS), + conf.get(DYN_ALLOCATION_INITIAL_EXECUTORS), + conf.get(EXECUTOR_INSTANCES).getOrElse(0)).max --- End diff -- Do we need to support environment variable `SPARK_EXECUTOR_INSTANCES`? Since it is not officially deprecated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13338: [SPARK-13723] [YARN] Change behavior of --num-executors ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13338 **[Test build #60427 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60427/consoleFull)** for PR 13338 at commit [`bf22b5a`](https://github.com/apache/spark/commit/bf22b5ab0bc8369949ac33833b078e7e13c7ce35). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13137: [SPARK-15247][SQL] Set the default number of part...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13137#discussion_r66862256 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -795,11 +795,15 @@ private[sql] object ParquetFileFormat extends Logging { // side, and resemble fake `FileStatus`es there. val partialFileStatusInfo = filesToTouch.map(f => (f.getPath.toString, f.getLen)) +// Set the number of partitions to prevent following schema reads from generating many tasks +// in case of a small number of parquet files. +val numParallelism = Math.min(partialFileStatusInfo.size + 1, 1) --- End diff -- `Math.min(partialFileStatusInfo.size + 1, parallelism)` is better. I think this case is different form https://github.com/apache/spark/pull/13444. At here, we already have a set of files and we apply the same operation to every file. However, for the issue that https://github.com/apache/spark/pull/13444 is trying to address, we do not really the amount of work assigned to a task (it depends on the number of actual files in a dir). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org