[GitHub] spark pull request #17973: [SPARK-20731][SQL] Add ability to change or omit ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17973#discussion_r116372498 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -622,6 +622,31 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("save tsv with tsv suffix") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath + val cars = spark.read +.format("csv") +.option("header", "true") +.load(testFile(carsFile)) + + cars.coalesce(1).write +.option("header", "true") +.option("fileExtension", ".tsv") +.option("delimiter", "\t") --- End diff -- Also, what is your usage scenario? It sounds like you want to omit the extension? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17953: [SPARK-20680][SQL] Spark-sql do not support for void col...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17953 Are your test scenario is like? ```Scala withTable("t", "tabNullType") { val client = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client client.runSqlHive("CREATE TABLE t (t1 int)") client.runSqlHive("INSERT INTO t VALUES (3)") client.runSqlHive("CREATE TABLE tabNullType AS SELECT NULL AS col FROM t") spark.table("tabNullType").show() spark.table("tabNullType").printSchema() } ``` Is this what you want? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17975: [SPARK-20725][SQL][BRANCH-2.1] partial aggregate should ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17975 seems there is a bug if we backport this without #17541 , cc @hvanhovell shall we also backport #17541 ? Or leave branch-2.1 along as this is not a critical bug? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17938 LGTM except a few minor comments. cc @tejasapatil @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17938#discussion_r116371744 --- Diff: docs/sql-programming-guide.md --- @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`. +### Bucketing, Sorting and Partitioning + +For file-based data source it is also possible to bucket and sort or partition the output. +Bucketing and sorting is applicable only to persistent tables: + + + + +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example write_sorting_and_bucketing python/sql/datasource.py %} + + + + +{% highlight sql %} + +CREATE TABLE users_bucketed_by_name( + name STRING, + favorite_color STRING, + favorite_NUMBERS array +) USING parquet +CLUSTERED BY(name) INTO 42 BUCKETS; + +{% endhighlight %} + + + + + +while partitioning can be used with both `save` and `saveAsTable`: + + + + + +{% include_example write_partitioning scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example write_partitioning java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example write_partitioning python/sql/datasource.py %} + + + + +{% highlight sql %} + +CREATE TABLE users_by_favorite_color( + name STRING, + favorite_color STRING, + favorite_NUMBERS array +) USING csv PARTITIONED BY(favorite_color); + +{% endhighlight %} + + + + + +It is possible to use both partitions and buckets for a single table: + + + + +{% include_example write_partition_and_bucket scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example write_partition_and_bucket java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example write_partition_and_bucket python/sql/datasource.py %} + + + + +{% highlight sql %} + +CREATE TABLE users_bucketed_and_partitioned( + name STRING, + favorite_color STRING, + favorite_NUMBERS array +) USING parquet +PARTITIONED BY (favorite_color) +CLUSTERED BY(name) INTO 42 BUCKETS; + +{% endhighlight %} + + + + + +`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section. +Because of that it has limited applicability to columns with high cardinality. In contrast `bucketBy` distributes +data across fixed number of buckets and can be used if a number of unique values is unbounded. --- End diff -- `used if` -> `used when ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17938#discussion_r116371733 --- Diff: docs/sql-programming-guide.md --- @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`. +### Bucketing, Sorting and Partitioning + +For file-based data source it is also possible to bucket and sort or partition the output. +Bucketing and sorting is applicable only to persistent tables: + + + + +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example write_sorting_and_bucketing python/sql/datasource.py %} + + + + +{% highlight sql %} + +CREATE TABLE users_bucketed_by_name( + name STRING, + favorite_color STRING, + favorite_NUMBERS array +) USING parquet +CLUSTERED BY(name) INTO 42 BUCKETS; + +{% endhighlight %} + + + + + +while partitioning can be used with both `save` and `saveAsTable`: + + + + + +{% include_example write_partitioning scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example write_partitioning java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example write_partitioning python/sql/datasource.py %} + + + + +{% highlight sql %} + +CREATE TABLE users_by_favorite_color( + name STRING, + favorite_color STRING, + favorite_NUMBERS array +) USING csv PARTITIONED BY(favorite_color); + +{% endhighlight %} + + + + + +It is possible to use both partitions and buckets for a single table: + + + + +{% include_example write_partition_and_bucket scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example write_partition_and_bucket java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example write_partition_and_bucket python/sql/datasource.py %} + + + + +{% highlight sql %} + +CREATE TABLE users_bucketed_and_partitioned( + name STRING, + favorite_color STRING, + favorite_NUMBERS array +) USING parquet +PARTITIONED BY (favorite_color) +CLUSTERED BY(name) INTO 42 BUCKETS; + +{% endhighlight %} + + + + + +`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section. +Because of that it has limited applicability to columns with high cardinality. In contrast `bucketBy` distributes --- End diff -- `In contrast `bucketBy` distributes` -> `In contrast, `bucketBy` distributes` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17938#discussion_r116371727 --- Diff: docs/sql-programming-guide.md --- @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`. +### Bucketing, Sorting and Partitioning + +For file-based data source it is also possible to bucket and sort or partition the output. +Bucketing and sorting is applicable only to persistent tables: + + + + +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example write_sorting_and_bucketing python/sql/datasource.py %} + + + + +{% highlight sql %} + +CREATE TABLE users_bucketed_by_name( + name STRING, + favorite_color STRING, + favorite_NUMBERS array +) USING parquet +CLUSTERED BY(name) INTO 42 BUCKETS; + +{% endhighlight %} + + + + + +while partitioning can be used with both `save` and `saveAsTable`: + + + + + +{% include_example write_partitioning scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example write_partitioning java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example write_partitioning python/sql/datasource.py %} + + + + +{% highlight sql %} + +CREATE TABLE users_by_favorite_color( + name STRING, + favorite_color STRING, + favorite_NUMBERS array +) USING csv PARTITIONED BY(favorite_color); + +{% endhighlight %} + + + + + +It is possible to use both partitions and buckets for a single table: + + + + +{% include_example write_partition_and_bucket scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example write_partition_and_bucket java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example write_partition_and_bucket python/sql/datasource.py %} + + + + +{% highlight sql %} + +CREATE TABLE users_bucketed_and_partitioned( + name STRING, + favorite_color STRING, + favorite_NUMBERS array +) USING parquet +PARTITIONED BY (favorite_color) +CLUSTERED BY(name) INTO 42 BUCKETS; + +{% endhighlight %} + + + + + +`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section. +Because of that it has limited applicability to columns with high cardinality. In contrast `bucketBy` distributes --- End diff -- `Because of that it has` -> ``` Thus, it has --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17938#discussion_r116371680 --- Diff: docs/sql-programming-guide.md --- @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`. +### Bucketing, Sorting and Partitioning + +For file-based data source it is also possible to bucket and sort or partition the output. +Bucketing and sorting is applicable only to persistent tables: + + + + +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example write_sorting_and_bucketing python/sql/datasource.py %} + + + + +{% highlight sql %} + +CREATE TABLE users_bucketed_by_name( + name STRING, + favorite_color STRING, + favorite_NUMBERS array +) USING parquet +CLUSTERED BY(name) INTO 42 BUCKETS; + +{% endhighlight %} + + + + + +while partitioning can be used with both `save` and `saveAsTable`: --- End diff -- Nit: ``` both `save` and `saveAsTable` ``` -> ``` both `save` and `saveAsTable` when using the Dataset APIs. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17938#discussion_r116371649 --- Diff: docs/sql-programming-guide.md --- @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`. +### Bucketing, Sorting and Partitioning + +For file-based data source it is also possible to bucket and sort or partition the output. +Bucketing and sorting is applicable only to persistent tables: + + + + +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example write_sorting_and_bucketing python/sql/datasource.py %} + + + + +{% highlight sql %} + +CREATE TABLE users_bucketed_by_name( + name STRING, + favorite_color STRING, + favorite_NUMBERS array +) USING parquet +CLUSTERED BY(name) INTO 42 BUCKETS; + +{% endhighlight %} + + + + + +while partitioning can be used with both `save` and `saveAsTable`: + + + + + +{% include_example write_partitioning scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example write_partitioning java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example write_partitioning python/sql/datasource.py %} + + + + +{% highlight sql %} + +CREATE TABLE users_by_favorite_color( + name STRING, + favorite_color STRING, + favorite_NUMBERS array +) USING csv PARTITIONED BY(favorite_color); + +{% endhighlight %} + + + + + +It is possible to use both partitions and buckets for a single table: --- End diff -- `partitions and buckets` -> `partitioning and bucketing` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17298 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17298 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76903/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17298 **[Test build #76903 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76903/testReport)** for PR 17298 at commit [`89cf739`](https://github.com/apache/spark/commit/89cf7394527b654c0a079244fe88378278f70e7a). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class ParseToTimestamp(left: Expression, format: Option[Expression], child: Expression)` * `class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging ` * `class CatalystSqlParser(conf: SQLConf) extends AbstractSqlParser ` * `case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) ` * `trait DataSourceScanExec extends LeafExecNode with CodegenSupport with PredicateHelper ` * `class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) ` * ` s\"($` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17938#discussion_r116371627 --- Diff: docs/sql-programming-guide.md --- @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`. +### Bucketing, Sorting and Partitioning + +For file-based data source it is also possible to bucket and sort or partition the output. --- End diff -- Nit, `For file-based data source it` -> `For file-based data source, it` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17938#discussion_r116371632 --- Diff: docs/sql-programming-guide.md --- @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`. +### Bucketing, Sorting and Partitioning + +For file-based data source it is also possible to bucket and sort or partition the output. +Bucketing and sorting is applicable only to persistent tables: --- End diff -- `is applicable` -> `are applicable` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17938#discussion_r116371615 --- Diff: docs/sql-programming-guide.md --- @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`. +### Bucketing, Sorting and Partitioning + +For file-based data source it is also possible to bucket and sort or partition the output. +Bucketing and sorting is applicable only to persistent tables: + + + + +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example write_sorting_and_bucketing python/sql/datasource.py %} + + + + +{% highlight sql %} + +CREATE TABLE users_bucketed_by_name( + name STRING, + favorite_color STRING, + favorite_NUMBERS array +) USING parquet +CLUSTERED BY(name) INTO 42 BUCKETS; --- End diff -- Could you please use the same table names `people_bucketed` with the same column names in the example? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17938#discussion_r116371598 --- Diff: docs/sql-programming-guide.md --- @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`. +### Bucketing, Sorting and Partitioning + +For file-based data source it is also possible to bucket and sort or partition the output. +Bucketing and sorting is applicable only to persistent tables: + + + + +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example write_sorting_and_bucketing python/sql/datasource.py %} + + + + +{% highlight sql %} + +CREATE TABLE users_bucketed_by_name( + name STRING, + favorite_color STRING, + favorite_NUMBERS array +) USING parquet +CLUSTERED BY(name) INTO 42 BUCKETS; --- End diff -- To be consistent with the example in the other APIs, it is missing the `SORTED BY` clause. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17938 In the current 2.2 docs, we already updated all the syntax to `CREATE TABLE ... USING...`. This is the new change delivered in 2.2 Thus, it is OK to document like what you just committed. Let me review them carefully now. Thanks for your work! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17975: [SPARK-20725][SQL][BRANCH-2.1] partial aggregate should ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17975 We did not backport https://github.com/apache/spark/pull/17541 to 2.1. Is it still OK to backport this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15435 ping! @jkbradley @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17975: [SPARK-20725][SQL][BRANCH-2.1] partial aggregate should ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17975 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17975: [SPARK-20725][SQL][BRANCH-2.1] partial aggregate should ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17975 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76904/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17975: [SPARK-20725][SQL][BRANCH-2.1] partial aggregate should ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17975 **[Test build #76904 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76904/testReport)** for PR 17975 at commit [`c4d1679`](https://github.com/apache/spark/commit/c4d16796f3ecec259e3e1af4afa9c13b4c5a142b). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17975: [SPARK-20725][SQL][BRANCH-2.1] partial aggregate should ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17975 cc @hvanhovell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17975: [SPARK-20725][SQL][BRANCH-2.1] partial aggregate should ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17975 **[Test build #76904 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76904/testReport)** for PR 17975 at commit [`c4d1679`](https://github.com/apache/spark/commit/c4d16796f3ecec259e3e1af4afa9c13b4c5a142b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17975: [SPARK-20725][SQL][BRANCH-2.1] partial aggregate ...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/17975 [SPARK-20725][SQL][BRANCH-2.1] partial aggregate should behave correctly for sameResult ## What changes were proposed in this pull request? this backports https://github.com/apache/spark/pull/17964 to 2.1 You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark tmp Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17975.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17975 commit c4d16796f3ecec259e3e1af4afa9c13b4c5a142b Author: Wenchen FanDate: 2017-05-14T01:24:05Z partial aggregate should behave correctly for sameResult --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17298 **[Test build #76903 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76903/testReport)** for PR 17298 at commit [`89cf739`](https://github.com/apache/spark/commit/89cf7394527b654c0a079244fe88378278f70e7a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17973: [SPARK-20731][SQL] Add ability to change or omit ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17973#discussion_r116370085 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -622,6 +622,31 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("save tsv with tsv suffix") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath + val cars = spark.read +.format("csv") +.option("header", "true") +.load(testFile(carsFile)) + + cars.coalesce(1).write +.option("header", "true") +.option("fileExtension", ".tsv") +.option("delimiter", "\t") --- End diff -- What is the reason why Hive introduced the conf `hive.output.file.extension`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17974: Eagle beta
Github user jashwantraj92 closed the pull request at: https://github.com/apache/spark/pull/17974 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17974: Eagle beta
GitHub user jashwantraj92 opened a pull request: https://github.com/apache/spark/pull/17974 Eagle beta ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/epfl-labos/spark eagle-beta Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17974.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17974 commit 748b05302937cbc26aad1b3db61c42c4d0bbe063 Author: PamelaDate: 2016-04-18T15:46:41Z Hawk/Eagle-beta plugin commit c237870c94e495e19da0cf0d47f23c0197a754c2 Author: Pamela Date: 2016-04-19T15:53:56Z Deleted Sparrow dependency --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17965#discussion_r116366145 --- Diff: R/pkg/R/generics.R --- @@ -799,6 +799,10 @@ setGeneric("write.df", function(df, path = NULL, ...) { standardGeneric("write.d #' @export setGeneric("randomSplit", function(x, weights, seed) { standardGeneric("randomSplit") }) +#' @rdname broadcast +#' @export +setGeneric("broadcast", function(x) { standardGeneric("broadcast") }) --- End diff -- > this list is sorted alphabetically within this section Looks like it used to be at some point, but these days are long gone. I can reorder it right now, but this means rearranging a whole section. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17964: [SPARK-20725][SQL] partial aggregate should behav...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17964 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17964 @cloud-fan can you backport this to 2.1? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17964 LGTM - merging to master/2.2/2.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use weight...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17084 @imatiach-msft Thanks for the PR. Added a couple of comments. Sorry for the delayed review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r116364047 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala --- @@ -77,12 +87,16 @@ class BinaryClassificationEvaluator @Since("1.4.0") (@Since("1.4.0") override va SchemaUtils.checkNumericType(schema, $(labelCol)) // TODO: When dataset metadata has been implemented, check rawPredictionCol vector length = 2. -val scoreAndLabels = - dataset.select(col($(rawPredictionCol)), col($(labelCol)).cast(DoubleType)).rdd.map { -case Row(rawPrediction: Vector, label: Double) => (rawPrediction(1), label) -case Row(rawPrediction: Double, label: Double) => (rawPrediction, label) +val scoreAndLabelsWithWeights = + dataset.select(col($(rawPredictionCol)), col($(labelCol)).cast(DoubleType), +if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))) --- End diff -- Check weightCol is double? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r116364179 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/binary/BinaryConfusionMatrix.scala --- @@ -22,22 +22,22 @@ package org.apache.spark.mllib.evaluation.binary */ private[evaluation] trait BinaryConfusionMatrix { /** number of true positives */ - def numTruePositives: Long + def numTruePositives: Double --- End diff -- I feel it may be better to create new attributes like `weightedTruePositives` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r116364061 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala --- @@ -36,12 +36,18 @@ import org.apache.spark.sql.types.DoubleType @Since("1.2.0") @Experimental class BinaryClassificationEvaluator @Since("1.4.0") (@Since("1.4.0") override val uid: String) - extends Evaluator with HasRawPredictionCol with HasLabelCol with DefaultParamsWritable { + extends Evaluator with HasRawPredictionCol with HasLabelCol +with HasWeightCol with DefaultParamsWritable { @Since("1.2.0") def this() = this(Identifiable.randomUID("binEval")) /** + * Default number of bins to use for binary classification evaluation. + */ + val defaultNumberOfBins = 1000 --- End diff -- Why 1000? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r116364140 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala --- @@ -41,13 +41,27 @@ import org.apache.spark.sql.DataFrame *partition boundaries. */ @Since("1.0.0") -class BinaryClassificationMetrics @Since("1.3.0") ( -@Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)], -@Since("1.3.0") val numBins: Int) extends Logging { +class BinaryClassificationMetrics @Since("2.2.0") ( +val numBins: Int, +@Since("2.2.0") val scoreAndLabelsWithWeights: RDD[(Double, (Double, Double))]) + extends Logging { require(numBins >= 0, "numBins must be nonnegative") /** + * Retrieves the score and labels (for binary compatibility). + * @return The score and labels. + */ + @Since("1.0.0") --- End diff -- inconsistent annotation. was 1.3.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17084#discussion_r116364224 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala --- @@ -146,11 +160,13 @@ class BinaryClassificationMetrics @Since("1.3.0") ( private lazy val ( cumulativeCounts: RDD[(Double, BinaryLabelCounter)], confusions: RDD[(Double, BinaryConfusionMatrix)]) = { -// Create a bin for each distinct score value, count positives and negatives within each bin, -// and then sort by score values in descending order. -val counts = scoreAndLabels.combineByKey( - createCombiner = (label: Double) => new BinaryLabelCounter(0L, 0L) += label, - mergeValue = (c: BinaryLabelCounter, label: Double) => c += label, +// Create a bin for each distinct score value, count weighted positives and +// negatives within each bin, and then sort by score values in descending order. +val counts = scoreAndLabelsWithWeights.combineByKey( + createCombiner = (labelAndWeight: (Double, Double)) => +new BinaryLabelCounter(0L, 0L) += (labelAndWeight._1, labelAndWeight._2), --- End diff -- `new BinaryLabelCounter(0.0, 0.0)`? Defined to take double parameters below. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17970: [SPARK-20730][SQL] Add an optimizer rule to combine nest...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17970 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17970: [SPARK-20730][SQL] Add an optimizer rule to combine nest...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17970 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76902/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17970: [SPARK-20730][SQL] Add an optimizer rule to combine nest...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17970 **[Test build #76902 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76902/testReport)** for PR 17970 at commit [`4a98693`](https://github.com/apache/spark/commit/4a9869327311a073b7c6e2197605f8422c2154ba). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17970: [SPARK-20730][SQL] Add an optimizer rule to combine nest...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17970 **[Test build #76902 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76902/testReport)** for PR 17970 at commit [`4a98693`](https://github.com/apache/spark/commit/4a9869327311a073b7c6e2197605f8422c2154ba). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17970: [SPARK-20730][SQL] Add an optimizer rule to combi...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/17970#discussion_r116360222 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -111,7 +111,8 @@ abstract class Optimizer(sessionCatalog: SessionCatalog, conf: SQLConf) RemoveRedundantProject, SimplifyCreateStructOps, SimplifyCreateArrayOps, - SimplifyCreateMapOps) ++ + SimplifyCreateMapOps, + CombineConcat) ++ --- End diff -- Thanks, comments! Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17970: [SPARK-20730][SQL] Add an optimizer rule to combine nest...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/17970 +1, LGTM except one minor naming comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17964 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76901/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17964 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17644: [SPARK-17729] [SQL] Enable creating hive bucketed...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17644#discussion_r116359891 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -307,6 +307,27 @@ case class InsertIntoHiveTable( } } +table.bucketSpec match { + case Some(bucketSpec) => +// Writes to bucketed hive tables are allowed only if user does not care about maintaining +// table's bucketing ie. both "hive.enforce.bucketing" and "hive.enforce.sorting" are +// set to false +val enforceBucketingConfig = "hive.enforce.bucketing" +val enforceSortingConfig = "hive.enforce.sorting" + +val message = s"Output Hive table ${table.identifier} is bucketed but Spark" + + "currently does NOT populate bucketed output which is compatible with Hive." + +if (hadoopConf.get(enforceBucketingConfig, "true").toBoolean || + hadoopConf.get(enforceSortingConfig, "true").toBoolean) { + throw new AnalysisException(message) +} else { + logWarning(message + s" Inserting data anyways since both $enforceBucketingConfig and " + +s"$enforceSortingConfig are set to false.") --- End diff -- shall we remove the bucket properties of the table in this case? what does hive do? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17970: [SPARK-20730][SQL] Add an optimizer rule to combi...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/17970#discussion_r116359887 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -111,7 +111,8 @@ abstract class Optimizer(sessionCatalog: SessionCatalog, conf: SQLConf) RemoveRedundantProject, SimplifyCreateStructOps, SimplifyCreateArrayOps, - SimplifyCreateMapOps) ++ + SimplifyCreateMapOps, + CombineConcat) ++ --- End diff -- Hi, @maropu . `CombineConcats` like the other `Combine~` optimizer? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17964 **[Test build #76901 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76901/testReport)** for PR 17964 at commit [`49da955`](https://github.com/apache/spark/commit/49da955dce260260325708d07becbc692cd3a005). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17644: [SPARK-17729] [SQL] Enable creating hive bucketed...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17644#discussion_r116359799 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -408,9 +425,7 @@ private[hive] class HiveClientImpl( }, schema = schema, partitionColumnNames = partCols.map(_.name), -// We can not populate bucketing information for Hive tables as Spark SQL has a different -// implementation of hash function from Hive. -bucketSpec = None, +bucketSpec = bucketSpec, --- End diff -- please add a comment to say that, for data source tables, we will always overwrite the bucket spec in `HiveExternalCatalog` with the bucketing information in table properties. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17644: [SPARK-17729] [SQL] Enable creating hive bucketed...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17644#discussion_r116359692 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala --- @@ -17,6 +17,7 @@ package org.apache.spark.sql.catalyst.catalog +import org.apache.spark.sql.AnalysisException --- End diff -- please remove these unnecessary changes in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17973: [SPARK-20731][SQL] Add ability to change or omit ...
Github user mikkokupsu commented on a diff in the pull request: https://github.com/apache/spark/pull/17973#discussion_r116359395 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -622,6 +622,31 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("save tsv with tsv suffix") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath + val cars = spark.read +.format("csv") +.option("header", "true") +.load(testFile(carsFile)) + + cars.coalesce(1).write +.option("header", "true") +.option("fileExtension", ".tsv") +.option("delimiter", "\t") --- End diff -- Hi @dongjoon-huyn Yes, the original goal was to remove the file extension but I decided to allow user decide. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
GitHub user zero323 reopened a pull request: https://github.com/apache/spark/pull/17965 [SPARK-20726][SPARKR] wrapper for SQL broadcast ## What changes were proposed in this pull request? - Adds R wrapper for `o.a.s.sql.functions.broadcast`. - Renames `broadcast` to `broadcast_`. ## How was this patch tested? Unit tests, check `check-cran.sh`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark SPARK-20726 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17965.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17965 commit f190d62460829dcfb84ff1a8e6dd6fe9cbd25719 Author: zero323Date: 2017-05-12T15:54:46Z Initial implementation commit 397ab1f7b4b4e2b9e51b697c92e3be197fed4554 Author: zero323 Date: 2017-05-12T17:38:31Z Fix style commit 246b91f8af84115af8f6283fb783000c9cc613ec Author: zero323 Date: 2017-05-13T10:08:08Z Style commit 1530785f7469830446cd95717d524eb42d88e4ab Author: zero323 Date: 2017-05-13T10:38:50Z Rename broadcast_ to broadcastRDD --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
Github user zero323 closed the pull request at: https://github.com/apache/spark/pull/17965 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...
Github user kevinyu98 commented on the issue: https://github.com/apache/spark/pull/12646 test please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17973: [SPARK-20731][SQL] Add ability to change or omit ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/17973#discussion_r116359183 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -622,6 +622,31 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("save tsv with tsv suffix") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath + val cars = spark.read +.format("csv") +.option("header", "true") +.load(testFile(carsFile)) + + cars.coalesce(1).write +.option("header", "true") +.option("fileExtension", ".tsv") +.option("delimiter", "\t") --- End diff -- Hi, @mikkokupsu Is the original goal to support the existing many files (without `.csv` extension)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17251: [SPARK-19910][SQL] `stack` should not reject NULL values...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/17251 Could this fix be part of Spark 2.2.0, @cloud-fan and @gatorsmile ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17941: [SPARK-20684][R] Expose createGlobalTempView and dropGlo...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/17941 Thank you for comments, @falaki and @felixcheung . I added the duplication link to the issue, SPARK-20684, and ask @falaki to close the JIRA issue because he is the reporter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17963: [SPARK-20722][CORE] Replay newer event log that hasn't b...
Github user sharkdtu commented on the issue: https://github.com/apache/spark/pull/17963 cc @srowen @ajbozarth --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16912: [SPARK-19576] [Core] Task attempt paths exist in ...
Github user sharkdtu closed the pull request at: https://github.com/apache/spark/pull/16912 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17956: [SPARK-18772][SQL] Avoid unnecessary conversion try for ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17956 Thank you everybody sincerely. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17973: [SPARK-20731][SQL] Add ability to change or omit ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17973#discussion_r116358020 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -622,6 +622,31 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } } + test("save tsv with tsv suffix") { +withTempDir { dir => + val csvDir = new File(dir, "csv").getCanonicalPath + val cars = spark.read +.format("csv") +.option("header", "true") +.load(testFile(carsFile)) + + cars.coalesce(1).write +.option("header", "true") +.option("fileExtension", ".tsv") +.option("delimiter", "\t") --- End diff -- I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user. I added those extentions long ago and one of the motivation was auto detection of datasource like Haddop does (which we ended up with not adding it yet due to the cost of listing files and etc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17956: [SPARK-18772][SQL] Avoid unnecessary conversion t...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17956 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16199: [SPARK-18772][SQL] NaN/Infinite float parsing in ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16199 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17956: [SPARK-18772][SQL] Avoid unnecessary conversion try for ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17956 thanks, merging to master/2.2! I think this change is pretty safe, we can discuss 2 things later: 1. if we want to support more special strings like `Inf` 2. if we want to make it case-insensitive --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17298 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76897/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17298 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17298 **[Test build #76897 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76897/testReport)** for PR 17298 at commit [`08632fd`](https://github.com/apache/spark/commit/08632fdcee127bf43cf90f44139925e2c26b4946). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17938 we are going to support bucketing in hive style CREATE TABLE syntax soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17964 **[Test build #76901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76901/testReport)** for PR 17964 at commit [`49da955`](https://github.com/apache/spark/commit/49da955dce260260325708d07becbc692cd3a005). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17964: [SPARK-20725][SQL] partial aggregate should behav...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17964#discussion_r116357380 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/SameResultSuite.scala --- @@ -46,4 +48,10 @@ class SameResultSuite extends QueryTest with SharedSQLContext { df.queryExecution.sparkPlan.find(_.isInstanceOf[FileSourceScanExec]).get .asInstanceOf[FileSourceScanExec] } + + test("SPARK-20725: partial aggregate should behave correctly for sameResult") { +val df1 = spark.range(10).agg(sum($"id")) +val df2 = spark.range(10).agg(sum($"id")) --- End diff -- Good catch! The reason is, `HashAggregateExec.requiredChildDistributionExpressions` is a `Option[Seq[Expression]]`, which is not treated as expressions of `HashAggregateExec`, and thus not touched by `QueryPlan.mapExpressions`. I have fixed it in `QueryPlan` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17938 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17938 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76900/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17938 **[Test build #76900 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76900/testReport)** for PR 17938 at commit [`92fb3b3`](https://github.com/apache/spark/commit/92fb3b3e00a666ff3bd1eca4e5dee0cefcca2d55). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17938 @cloud-fan Thanks for the clarification. Just a thought - shouldn't we either support it consistently or don't support at all? Current behaviour is quite confusing and I don't think that documentation alone will cut it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17938 **[Test build #76900 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76900/testReport)** for PR 17938 at commit [`92fb3b3`](https://github.com/apache/spark/commit/92fb3b3e00a666ff3bd1eca4e5dee0cefcca2d55). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17938 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76899/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17938 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17938 **[Test build #76899 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76899/testReport)** for PR 17938 at commit [`b5babf6`](https://github.com/apache/spark/commit/b5babf65571661ca45880cd80a950959f66523a1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17965 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17965 **[Test build #76898 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76898/testReport)** for PR 17965 at commit [`1530785`](https://github.com/apache/spark/commit/1530785f7469830446cd95717d524eb42d88e4ab). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17965 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76898/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17938 When you omit `USING`, it's hive style CREATE TABLE syntax, which is very different from Spark. We should encourage users to use the spark style CREATE TABLE syntax and only document it(with USING statement). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17938 **[Test build #76899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76899/testReport)** for PR 17938 at commit [`b5babf6`](https://github.com/apache/spark/commit/b5babf65571661ca45880cd80a950959f66523a1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17965#discussion_r116355839 --- Diff: R/pkg/R/DataFrame.R --- @@ -3769,3 +3769,33 @@ setMethod("alias", sdf <- callJMethod(object@sdf, "alias", data) dataFrame(sdf) }) + --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17965 **[Test build #76898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76898/testReport)** for PR 17965 at commit [`1530785`](https://github.com/apache/spark/commit/1530785f7469830446cd95717d524eb42d88e4ab). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17965#discussion_r116355836 --- Diff: R/pkg/R/generics.R --- @@ -799,6 +799,10 @@ setGeneric("write.df", function(df, path = NULL, ...) { standardGeneric("write.d #' @export setGeneric("randomSplit", function(x, weights, seed) { standardGeneric("randomSplit") }) +#' @rdname broadcast +#' @export +setGeneric("broadcast", function(x) { standardGeneric("broadcast") }) --- End diff -- It doesn't seem to affect the docs so I don't think we have to touch this for now: ![image](https://cloud.githubusercontent.com/assets/1554276/26024791/88a39940-37d9-11e7-9f11-ac1510b59215.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17298 **[Test build #76897 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76897/testReport)** for PR 17298 at commit [`08632fd`](https://github.com/apache/spark/commit/08632fdcee127bf43cf90f44139925e2c26b4946). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17969#discussion_r116355659 --- Diff: R/pkg/R/mllib_wrapper.R --- @@ -0,0 +1,61 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +#' S4 class that represents a Java ML model +#' +#' @param jobj a Java object reference to the backing Scala model +#' @export +#' @note JavaModel since 2.3.0 +setClass("JavaModel", representation(jobj = "jobj")) + +#' Makes predictions from a Java ML model +#' +#' @param object a Spark ML model. +#' @param newData a SparkDataFrame for testing. +#' @return \code{predict} returns a SparkDataFrame containing predicted value. +#' @rdname spark.predict +#' @aliases predict,JavaModel-method --- End diff -- I am biased here, but I'll argue that it doesn't. Both `predict` and `write.ml` (same as `read.ml`) are extremely generic and in general we don't provide any useful information about these. And the usage is already covered by class `examples`. Finally we can use `@seealso` to provide a bit more R-is experience if you think it is not enough Something around the lines of `lm` docs: ![image](https://cloud.githubusercontent.com/assets/1554276/26024731/2214f012-37d8-11e7-9afb-b750e9c647ff.png) Moreover using this approach significantly reduces amount of clutter in the generated docs. There are shorter, argument list is focused on the important parts, same as `value`. See for example GLM docs below. So IMHO this is actually a significant improvement. Personally I would do the same with all the `prints` and `summaries` as well, although it wouldn't reduce the codebase (for now ð). This would further shorten the docs and remove awkward descriptions like this: ![image](https://cloud.githubusercontent.com/assets/1554276/26024707/567b2020-37d7-11e7-8c21-260404d7767d.png) And from the developer side it is a clear win. No mindless copy / paste / replace cycle and more time to provide useful examples. __Before__: ![image](https://cloud.githubusercontent.com/assets/1554276/26024648/1c36253c-37d6-11e7-9411-72c0c14c54a8.png) __After__: ![image](https://cloud.githubusercontent.com/assets/1554276/26024653/2643bd64-37d6-11e7-8463-08662611cd37.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12646 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76895/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12646 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12646 **[Test build #76895 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76895/testReport)** for PR 12646 at commit [`63fab9f`](https://github.com/apache/spark/commit/63fab9f87af0a551efe9f9d3872ff17b972ee834). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17965#discussion_r116355102 --- Diff: R/pkg/R/DataFrame.R --- @@ -3769,3 +3769,33 @@ setMethod("alias", sdf <- callJMethod(object@sdf, "alias", data) dataFrame(sdf) }) + + +#' broadcast +#' +#' Return a new SparkDataFrame marked as small enough for use in broadcast joins. +#' +#' Equivalent to hint(x, "broadcast). --- End diff -- I double check this but for some reason `\code` here made `roxygen` unhappy when I tried it last time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17938 @gatorsmile Huh... in that case it looks like parser (?) needs a little bit of work, unless of course following are features. - Omitting `USING` doesn't work ```sql CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) CLUSTERED BY(user_id) INTO 256 BUCKETS ``` with: ``` Error in query: Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 1, pos 0) == SQL == CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) ^^^ CLUSTERED BY(user_id) INTO 256 BUCKETS ``` - Omitting `USING` adding `PARTITION BY` with column not present in the main clause (valid Hive DDL) doesn't work: ```sql CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) PARTITIONED BY (department STRING) CLUSTERED BY(user_id) INTO 256 BUCKETS ``` with ``` Error in query: Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 1, pos 2) == SQL == CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) --^^^ PARTITIONED BY (department STRING) CLUSTERED BY(user_id) INTO 256 BUCKETS ``` - `PARTITION BY` alone works: ```sql CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) PARTITIONED BY (department STRING) ``` - `PARTITION BY` with `USING` when partition column is in the main spec works: ```sql CREATE TABLE user_info_bucketed( user_id BIGINT, firstname STRING, lastname STRING, department STRING) USING parquet PARTITIONED BY (department) ``` - `CLUSTERED BY` + `PARTITION BY` with `USING` when partition column is in the main spec works: ```sql CREATE TABLE user_info_bucketed( user_id BIGINT, firstname STRING, lastname STRING, department STRING) USING parquet PARTITIONED BY (department) CLUSTERED BY(user_id) INTO 256 BUCKETS ``` - `PARTITION BY` when parition column is in the main spec, `USING` omitted: ```sql CREATE TABLE user_info_bucketed( user_id BIGINT, firstname STRING, lastname STRING, department STRING) PARTITIONED BY (department) ``` with: ``` Error in query: mismatched input ')' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSA CTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IGNORE', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'T RANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 3, pos 30) == SQL == CREATE TABLE user_info_bucketed( user_id BIGINT, firstname STRING, lastname STRING, department STRING) PARTITIONED BY
[GitHub] spark issue #17973: [SPARK-20731][SQL] Add ability to change or omit .csv fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17973 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17973: [SPARK-20731][SQL] Add ability to change or omit .csv fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17973 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76896/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17964 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76893/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17964 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17964 **[Test build #76893 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76893/testReport)** for PR 17964 at commit [`557298e`](https://github.com/apache/spark/commit/557298e3d88c04910ebff9cdb1ae77a1537c83af). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org