date:20170513

[GitHub] spark pull request #17973: [SPARK-20731][SQL] Add ability to change or omit ...

2017-05-13 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17973#discussion_r116372498
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -622,6 +622,31 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 }
   }
 
+  test("save tsv with tsv suffix") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
+  val cars = spark.read
+.format("csv")
+.option("header", "true")
+.load(testFile(carsFile))
+
+  cars.coalesce(1).write
+.option("header", "true")
+.option("fileExtension", ".tsv")
+.option("delimiter", "\t")
--- End diff --

Also, what is your usage scenario? It sounds like you want to omit the 
extension?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17953: [SPARK-20680][SQL] Spark-sql do not support for void col...

2017-05-13 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17953
  
Are your test scenario is like? 
```Scala
withTable("t", "tabNullType") {
  val client = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
  client.runSqlHive("CREATE TABLE t (t1 int)")
  client.runSqlHive("INSERT INTO t VALUES (3)")
  client.runSqlHive("CREATE TABLE tabNullType AS SELECT NULL AS col 
FROM t")
  spark.table("tabNullType").show()
  spark.table("tabNullType").printSchema()
}
```

Is this what you want?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17975: [SPARK-20725][SQL][BRANCH-2.1] partial aggregate should ...

2017-05-13 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17975
  
seems there is a bug if we backport this without #17541 , cc @hvanhovell 
shall we also backport #17541 ? Or leave branch-2.1 along as this is not a 
critical bug?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-13 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17938
  
LGTM except a few minor comments. 

cc @tejasapatil @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

2017-05-13 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17938#discussion_r116371744
  
--- Diff: docs/sql-programming-guide.md ---
@@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables 
have per-partition metadat
 
 Note that partition information is not gathered by default when creating 
external datasource tables (those with a `path` option). To sync the partition 
information in the metastore, you can invoke `MSCK REPAIR TABLE`.
 
+### Bucketing, Sorting and Partitioning
+
+For file-based data source it is also possible to bucket and sort or 
partition the output. 
+Bucketing and sorting is applicable only to persistent tables:
+
+
+
+
+{% include_example write_sorting_and_bucketing 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example write_sorting_and_bucketing 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
+
+
+
+
+{% highlight sql %}
+
+CREATE TABLE users_bucketed_by_name(
+  name STRING,
+  favorite_color STRING,
+  favorite_NUMBERS array
+) USING parquet 
+CLUSTERED BY(name) INTO 42 BUCKETS;
+
+{% endhighlight %}
+
+
+
+
+
+while partitioning can be used with both `save` and `saveAsTable`:
+
+
+
+
+
+{% include_example write_partitioning 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example write_partitioning 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example write_partitioning python/sql/datasource.py %}
+
+
+
+
+{% highlight sql %}
+
+CREATE TABLE users_by_favorite_color(
+  name STRING, 
+  favorite_color STRING,
+  favorite_NUMBERS array
+) USING csv PARTITIONED BY(favorite_color);
+
+{% endhighlight %}
+
+
+
+
+
+It is possible to use both partitions and buckets for a single table:
+
+
+
+
+{% include_example write_partition_and_bucket 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example write_partition_and_bucket 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example write_partition_and_bucket python/sql/datasource.py %}
+
+
+
+
+{% highlight sql %}
+
+CREATE TABLE users_bucketed_and_partitioned(
+  name STRING,
+  favorite_color STRING,
+  favorite_NUMBERS array
+) USING parquet 
+PARTITIONED BY (favorite_color)
+CLUSTERED BY(name) INTO 42 BUCKETS;
+
+{% endhighlight %}
+
+
+
+
+
+`partitionBy` creates a directory structure as described in the [Partition 
Discovery](#partition-discovery) section.
+Because of that it has limited applicability to columns with high 
cardinality. In contrast `bucketBy` distributes
+data across fixed number of buckets and can be used if a number of unique 
values is unbounded.
--- End diff --

`used if` -> `used when `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

2017-05-13 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17938#discussion_r116371733
  
--- Diff: docs/sql-programming-guide.md ---
@@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables 
have per-partition metadat
 
 Note that partition information is not gathered by default when creating 
external datasource tables (those with a `path` option). To sync the partition 
information in the metastore, you can invoke `MSCK REPAIR TABLE`.
 
+### Bucketing, Sorting and Partitioning
+
+For file-based data source it is also possible to bucket and sort or 
partition the output. 
+Bucketing and sorting is applicable only to persistent tables:
+
+
+
+
+{% include_example write_sorting_and_bucketing 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example write_sorting_and_bucketing 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
+
+
+
+
+{% highlight sql %}
+
+CREATE TABLE users_bucketed_by_name(
+  name STRING,
+  favorite_color STRING,
+  favorite_NUMBERS array
+) USING parquet 
+CLUSTERED BY(name) INTO 42 BUCKETS;
+
+{% endhighlight %}
+
+
+
+
+
+while partitioning can be used with both `save` and `saveAsTable`:
+
+
+
+
+
+{% include_example write_partitioning 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example write_partitioning 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example write_partitioning python/sql/datasource.py %}
+
+
+
+
+{% highlight sql %}
+
+CREATE TABLE users_by_favorite_color(
+  name STRING, 
+  favorite_color STRING,
+  favorite_NUMBERS array
+) USING csv PARTITIONED BY(favorite_color);
+
+{% endhighlight %}
+
+
+
+
+
+It is possible to use both partitions and buckets for a single table:
+
+
+
+
+{% include_example write_partition_and_bucket 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example write_partition_and_bucket 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example write_partition_and_bucket python/sql/datasource.py %}
+
+
+
+
+{% highlight sql %}
+
+CREATE TABLE users_bucketed_and_partitioned(
+  name STRING,
+  favorite_color STRING,
+  favorite_NUMBERS array
+) USING parquet 
+PARTITIONED BY (favorite_color)
+CLUSTERED BY(name) INTO 42 BUCKETS;
+
+{% endhighlight %}
+
+
+
+
+
+`partitionBy` creates a directory structure as described in the [Partition 
Discovery](#partition-discovery) section.
+Because of that it has limited applicability to columns with high 
cardinality. In contrast `bucketBy` distributes
--- End diff --

`In contrast `bucketBy` distributes` -> `In contrast, `bucketBy` 
distributes`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

2017-05-13 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17938#discussion_r116371727
  
--- Diff: docs/sql-programming-guide.md ---
@@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables 
have per-partition metadat
 
 Note that partition information is not gathered by default when creating 
external datasource tables (those with a `path` option). To sync the partition 
information in the metastore, you can invoke `MSCK REPAIR TABLE`.
 
+### Bucketing, Sorting and Partitioning
+
+For file-based data source it is also possible to bucket and sort or 
partition the output. 
+Bucketing and sorting is applicable only to persistent tables:
+
+
+
+
+{% include_example write_sorting_and_bucketing 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example write_sorting_and_bucketing 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
+
+
+
+
+{% highlight sql %}
+
+CREATE TABLE users_bucketed_by_name(
+  name STRING,
+  favorite_color STRING,
+  favorite_NUMBERS array
+) USING parquet 
+CLUSTERED BY(name) INTO 42 BUCKETS;
+
+{% endhighlight %}
+
+
+
+
+
+while partitioning can be used with both `save` and `saveAsTable`:
+
+
+
+
+
+{% include_example write_partitioning 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example write_partitioning 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example write_partitioning python/sql/datasource.py %}
+
+
+
+
+{% highlight sql %}
+
+CREATE TABLE users_by_favorite_color(
+  name STRING, 
+  favorite_color STRING,
+  favorite_NUMBERS array
+) USING csv PARTITIONED BY(favorite_color);
+
+{% endhighlight %}
+
+
+
+
+
+It is possible to use both partitions and buckets for a single table:
+
+
+
+
+{% include_example write_partition_and_bucket 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example write_partition_and_bucket 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example write_partition_and_bucket python/sql/datasource.py %}
+
+
+
+
+{% highlight sql %}
+
+CREATE TABLE users_bucketed_and_partitioned(
+  name STRING,
+  favorite_color STRING,
+  favorite_NUMBERS array
+) USING parquet 
+PARTITIONED BY (favorite_color)
+CLUSTERED BY(name) INTO 42 BUCKETS;
+
+{% endhighlight %}
+
+
+
+
+
+`partitionBy` creates a directory structure as described in the [Partition 
Discovery](#partition-discovery) section.
+Because of that it has limited applicability to columns with high 
cardinality. In contrast `bucketBy` distributes
--- End diff --

`Because of that it has`
->
```
Thus, it has


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

2017-05-13 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17938#discussion_r116371680
  
--- Diff: docs/sql-programming-guide.md ---
@@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables 
have per-partition metadat
 
 Note that partition information is not gathered by default when creating 
external datasource tables (those with a `path` option). To sync the partition 
information in the metastore, you can invoke `MSCK REPAIR TABLE`.
 
+### Bucketing, Sorting and Partitioning
+
+For file-based data source it is also possible to bucket and sort or 
partition the output. 
+Bucketing and sorting is applicable only to persistent tables:
+
+
+
+
+{% include_example write_sorting_and_bucketing 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example write_sorting_and_bucketing 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
+
+
+
+
+{% highlight sql %}
+
+CREATE TABLE users_bucketed_by_name(
+  name STRING,
+  favorite_color STRING,
+  favorite_NUMBERS array
+) USING parquet 
+CLUSTERED BY(name) INTO 42 BUCKETS;
+
+{% endhighlight %}
+
+
+
+
+
+while partitioning can be used with both `save` and `saveAsTable`:
--- End diff --

Nit: 
```
both `save` and `saveAsTable`
```
->
```
both `save` and `saveAsTable` when using the Dataset APIs. 
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

2017-05-13 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17938#discussion_r116371649
  
--- Diff: docs/sql-programming-guide.md ---
@@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables 
have per-partition metadat
 
 Note that partition information is not gathered by default when creating 
external datasource tables (those with a `path` option). To sync the partition 
information in the metastore, you can invoke `MSCK REPAIR TABLE`.
 
+### Bucketing, Sorting and Partitioning
+
+For file-based data source it is also possible to bucket and sort or 
partition the output. 
+Bucketing and sorting is applicable only to persistent tables:
+
+
+
+
+{% include_example write_sorting_and_bucketing 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example write_sorting_and_bucketing 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
+
+
+
+
+{% highlight sql %}
+
+CREATE TABLE users_bucketed_by_name(
+  name STRING,
+  favorite_color STRING,
+  favorite_NUMBERS array
+) USING parquet 
+CLUSTERED BY(name) INTO 42 BUCKETS;
+
+{% endhighlight %}
+
+
+
+
+
+while partitioning can be used with both `save` and `saveAsTable`:
+
+
+
+
+
+{% include_example write_partitioning 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example write_partitioning 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example write_partitioning python/sql/datasource.py %}
+
+
+
+
+{% highlight sql %}
+
+CREATE TABLE users_by_favorite_color(
+  name STRING, 
+  favorite_color STRING,
+  favorite_NUMBERS array
+) USING csv PARTITIONED BY(favorite_color);
+
+{% endhighlight %}
+
+
+
+
+
+It is possible to use both partitions and buckets for a single table:
--- End diff --

`partitions and buckets` -> `partitioning and bucketing`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17298
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17298
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76903/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17298
  
**[Test build #76903 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76903/testReport)**
 for PR 17298 at commit 
[`89cf739`](https://github.com/apache/spark/commit/89cf7394527b654c0a079244fe88378278f70e7a).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class ParseToTimestamp(left: Expression, format: 
Option[Expression], child: Expression)`
  * `class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] 
with Logging `
  * `class CatalystSqlParser(conf: SQLConf) extends AbstractSqlParser `
  * `case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) `
  * `trait DataSourceScanExec extends LeafExecNode with CodegenSupport with 
PredicateHelper `
  * `class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) `
  * `  s\"($`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

2017-05-13 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17938#discussion_r116371627
  
--- Diff: docs/sql-programming-guide.md ---
@@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables 
have per-partition metadat
 
 Note that partition information is not gathered by default when creating 
external datasource tables (those with a `path` option). To sync the partition 
information in the metastore, you can invoke `MSCK REPAIR TABLE`.
 
+### Bucketing, Sorting and Partitioning
+
+For file-based data source it is also possible to bucket and sort or 
partition the output. 
--- End diff --

Nit, `For file-based data source it` -> `For file-based data source, it`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

2017-05-13 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17938#discussion_r116371632
  
--- Diff: docs/sql-programming-guide.md ---
@@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables 
have per-partition metadat
 
 Note that partition information is not gathered by default when creating 
external datasource tables (those with a `path` option). To sync the partition 
information in the metastore, you can invoke `MSCK REPAIR TABLE`.
 
+### Bucketing, Sorting and Partitioning
+
+For file-based data source it is also possible to bucket and sort or 
partition the output. 
+Bucketing and sorting is applicable only to persistent tables:
--- End diff --

`is applicable` -> `are applicable`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

2017-05-13 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17938#discussion_r116371615
  
--- Diff: docs/sql-programming-guide.md ---
@@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables 
have per-partition metadat
 
 Note that partition information is not gathered by default when creating 
external datasource tables (those with a `path` option). To sync the partition 
information in the metastore, you can invoke `MSCK REPAIR TABLE`.
 
+### Bucketing, Sorting and Partitioning
+
+For file-based data source it is also possible to bucket and sort or 
partition the output. 
+Bucketing and sorting is applicable only to persistent tables:
+
+
+
+
+{% include_example write_sorting_and_bucketing 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example write_sorting_and_bucketing 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
+
+
+
+
+{% highlight sql %}
+
+CREATE TABLE users_bucketed_by_name(
+  name STRING,
+  favorite_color STRING,
+  favorite_NUMBERS array
+) USING parquet 
+CLUSTERED BY(name) INTO 42 BUCKETS;
--- End diff --

Could you please use the same table names `people_bucketed` with the same 
column names in the example? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

2017-05-13 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17938#discussion_r116371598
  
--- Diff: docs/sql-programming-guide.md ---
@@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables 
have per-partition metadat
 
 Note that partition information is not gathered by default when creating 
external datasource tables (those with a `path` option). To sync the partition 
information in the metastore, you can invoke `MSCK REPAIR TABLE`.
 
+### Bucketing, Sorting and Partitioning
+
+For file-based data source it is also possible to bucket and sort or 
partition the output. 
+Bucketing and sorting is applicable only to persistent tables:
+
+
+
+
+{% include_example write_sorting_and_bucketing 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example write_sorting_and_bucketing 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
+
+
+
+
+{% highlight sql %}
+
+CREATE TABLE users_bucketed_by_name(
+  name STRING,
+  favorite_color STRING,
+  favorite_NUMBERS array
+) USING parquet 
+CLUSTERED BY(name) INTO 42 BUCKETS;
--- End diff --

To be consistent with the example in the other APIs, it is missing the 
`SORTED BY` clause.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-13 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17938
  
In the current 2.2 docs, we already updated all the syntax to `CREATE TABLE 
... USING...`. This is the new change delivered in 2.2 

Thus, it is OK to document like what you just committed. Let me review them 
carefully now. Thanks for your work!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17975: [SPARK-20725][SQL][BRANCH-2.1] partial aggregate should ...

2017-05-13 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17975
  
We did not backport https://github.com/apache/spark/pull/17541 to 2.1. Is 
it still OK to backport this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-05-13 Thread sethah

Github user sethah commented on the issue:

https://github.com/apache/spark/pull/15435
  
ping! @jkbradley @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17975: [SPARK-20725][SQL][BRANCH-2.1] partial aggregate should ...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17975
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17975: [SPARK-20725][SQL][BRANCH-2.1] partial aggregate should ...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17975
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76904/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17975: [SPARK-20725][SQL][BRANCH-2.1] partial aggregate should ...

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17975
  
**[Test build #76904 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76904/testReport)**
 for PR 17975 at commit 
[`c4d1679`](https://github.com/apache/spark/commit/c4d16796f3ecec259e3e1af4afa9c13b4c5a142b).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17975: [SPARK-20725][SQL][BRANCH-2.1] partial aggregate should ...

2017-05-13 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17975
  
cc @hvanhovell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17975: [SPARK-20725][SQL][BRANCH-2.1] partial aggregate should ...

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17975
  
**[Test build #76904 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76904/testReport)**
 for PR 17975 at commit 
[`c4d1679`](https://github.com/apache/spark/commit/c4d16796f3ecec259e3e1af4afa9c13b4c5a142b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17975: [SPARK-20725][SQL][BRANCH-2.1] partial aggregate ...

2017-05-13 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/17975

[SPARK-20725][SQL][BRANCH-2.1] partial aggregate should behave correctly 
for sameResult

## What changes were proposed in this pull request?

this backports https://github.com/apache/spark/pull/17964 to 2.1

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark tmp

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17975.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17975


commit c4d16796f3ecec259e3e1af4afa9c13b4c5a142b
Author: Wenchen Fan 
Date:   2017-05-14T01:24:05Z

partial aggregate should behave correctly for sameResult




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17298
  
**[Test build #76903 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76903/testReport)**
 for PR 17298 at commit 
[`89cf739`](https://github.com/apache/spark/commit/89cf7394527b654c0a079244fe88378278f70e7a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17973: [SPARK-20731][SQL] Add ability to change or omit ...

2017-05-13 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17973#discussion_r116370085
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -622,6 +622,31 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 }
   }
 
+  test("save tsv with tsv suffix") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
+  val cars = spark.read
+.format("csv")
+.option("header", "true")
+.load(testFile(carsFile))
+
+  cars.coalesce(1).write
+.option("header", "true")
+.option("fileExtension", ".tsv")
+.option("delimiter", "\t")
--- End diff --

What is the reason why Hive introduced the conf 
`hive.output.file.extension`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17974: Eagle beta

2017-05-13 Thread jashwantraj92

Github user jashwantraj92 closed the pull request at:

https://github.com/apache/spark/pull/17974


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17974: Eagle beta

2017-05-13 Thread jashwantraj92

GitHub user jashwantraj92 opened a pull request:

https://github.com/apache/spark/pull/17974

Eagle beta

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/epfl-labos/spark eagle-beta

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17974.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17974


commit 748b05302937cbc26aad1b3db61c42c4d0bbe063
Author: Pamela 
Date:   2016-04-18T15:46:41Z

Hawk/Eagle-beta plugin

commit c237870c94e495e19da0cf0d47f23c0197a754c2
Author: Pamela 
Date:   2016-04-19T15:53:56Z

Deleted Sparrow dependency




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-13 Thread zero323

Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17965#discussion_r116366145
  
--- Diff: R/pkg/R/generics.R ---
@@ -799,6 +799,10 @@ setGeneric("write.df", function(df, path = NULL, ...) 
{ standardGeneric("write.d
 #' @export
 setGeneric("randomSplit", function(x, weights, seed) { 
standardGeneric("randomSplit") })
 
+#' @rdname broadcast
+#' @export
+setGeneric("broadcast", function(x) { standardGeneric("broadcast") })
--- End diff --

> this list is sorted alphabetically within this section

Looks like it used to be at some point, but these days are long gone. I can 
reorder it right now, but this means rearranging a whole section. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17964: [SPARK-20725][SQL] partial aggregate should behav...

2017-05-13 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17964


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

2017-05-13 Thread hvanhovell

Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/17964
  
@cloud-fan can you backport this to 2.1?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

2017-05-13 Thread hvanhovell

Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/17964
  
LGTM - merging to master/2.2/2.1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use weight...

2017-05-13 Thread actuaryzhang

Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/17084
  
@imatiach-msft Thanks for the PR. Added a couple of comments. Sorry for the 
delayed review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-05-13 Thread actuaryzhang

Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17084#discussion_r116364047
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
 ---
@@ -77,12 +87,16 @@ class BinaryClassificationEvaluator @Since("1.4.0") 
(@Since("1.4.0") override va
 SchemaUtils.checkNumericType(schema, $(labelCol))
 
 // TODO: When dataset metadata has been implemented, check 
rawPredictionCol vector length = 2.
-val scoreAndLabels =
-  dataset.select(col($(rawPredictionCol)), 
col($(labelCol)).cast(DoubleType)).rdd.map {
-case Row(rawPrediction: Vector, label: Double) => 
(rawPrediction(1), label)
-case Row(rawPrediction: Double, label: Double) => (rawPrediction, 
label)
+val scoreAndLabelsWithWeights =
+  dataset.select(col($(rawPredictionCol)), 
col($(labelCol)).cast(DoubleType),
+if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else 
col($(weightCol)))
--- End diff --

Check weightCol is double?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-05-13 Thread actuaryzhang

Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17084#discussion_r116364179
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/binary/BinaryConfusionMatrix.scala
 ---
@@ -22,22 +22,22 @@ package org.apache.spark.mllib.evaluation.binary
  */
 private[evaluation] trait BinaryConfusionMatrix {
   /** number of true positives */
-  def numTruePositives: Long
+  def numTruePositives: Double
--- End diff --

I feel it may be better to create new attributes like 
`weightedTruePositives`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-05-13 Thread actuaryzhang

Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17084#discussion_r116364061
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
 ---
@@ -36,12 +36,18 @@ import org.apache.spark.sql.types.DoubleType
 @Since("1.2.0")
 @Experimental
 class BinaryClassificationEvaluator @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
-  extends Evaluator with HasRawPredictionCol with HasLabelCol with 
DefaultParamsWritable {
+  extends Evaluator with HasRawPredictionCol with HasLabelCol
+with HasWeightCol with DefaultParamsWritable {
 
   @Since("1.2.0")
   def this() = this(Identifiable.randomUID("binEval"))
 
   /**
+   * Default number of bins to use for binary classification evaluation.
+   */
+  val defaultNumberOfBins = 1000
--- End diff --

Why 1000?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-05-13 Thread actuaryzhang

Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17084#discussion_r116364140
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ---
@@ -41,13 +41,27 @@ import org.apache.spark.sql.DataFrame
  *partition boundaries.
  */
 @Since("1.0.0")
-class BinaryClassificationMetrics @Since("1.3.0") (
-@Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)],
-@Since("1.3.0") val numBins: Int) extends Logging {
+class BinaryClassificationMetrics @Since("2.2.0") (
+val numBins: Int,
+@Since("2.2.0") val scoreAndLabelsWithWeights: RDD[(Double, (Double, 
Double))])
+  extends Logging {
 
   require(numBins >= 0, "numBins must be nonnegative")
 
   /**
+   * Retrieves the score and labels (for binary compatibility).
+   * @return The score and labels.
+   */
+  @Since("1.0.0")
--- End diff --

inconsistent annotation. was 1.3.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17084: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2017-05-13 Thread actuaryzhang

Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17084#discussion_r116364224
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ---
@@ -146,11 +160,13 @@ class BinaryClassificationMetrics @Since("1.3.0") (
   private lazy val (
 cumulativeCounts: RDD[(Double, BinaryLabelCounter)],
 confusions: RDD[(Double, BinaryConfusionMatrix)]) = {
-// Create a bin for each distinct score value, count positives and 
negatives within each bin,
-// and then sort by score values in descending order.
-val counts = scoreAndLabels.combineByKey(
-  createCombiner = (label: Double) => new BinaryLabelCounter(0L, 0L) 
+= label,
-  mergeValue = (c: BinaryLabelCounter, label: Double) => c += label,
+// Create a bin for each distinct score value, count weighted 
positives and
+// negatives within each bin, and then sort by score values in 
descending order.
+val counts = scoreAndLabelsWithWeights.combineByKey(
+  createCombiner = (labelAndWeight: (Double, Double)) =>
+new BinaryLabelCounter(0L, 0L) += (labelAndWeight._1, 
labelAndWeight._2),
--- End diff --

`new BinaryLabelCounter(0.0, 0.0)`? Defined to take double parameters 
below. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17970: [SPARK-20730][SQL] Add an optimizer rule to combine nest...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17970
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17970: [SPARK-20730][SQL] Add an optimizer rule to combine nest...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17970
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76902/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17970: [SPARK-20730][SQL] Add an optimizer rule to combine nest...

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17970
  
**[Test build #76902 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76902/testReport)**
 for PR 17970 at commit 
[`4a98693`](https://github.com/apache/spark/commit/4a9869327311a073b7c6e2197605f8422c2154ba).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17970: [SPARK-20730][SQL] Add an optimizer rule to combine nest...

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17970
  
**[Test build #76902 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76902/testReport)**
 for PR 17970 at commit 
[`4a98693`](https://github.com/apache/spark/commit/4a9869327311a073b7c6e2197605f8422c2154ba).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17970: [SPARK-20730][SQL] Add an optimizer rule to combi...

2017-05-13 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17970#discussion_r116360222
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -111,7 +111,8 @@ abstract class Optimizer(sessionCatalog: 
SessionCatalog, conf: SQLConf)
   RemoveRedundantProject,
   SimplifyCreateStructOps,
   SimplifyCreateArrayOps,
-  SimplifyCreateMapOps) ++
+  SimplifyCreateMapOps,
+  CombineConcat) ++
--- End diff --

Thanks, comments! Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17970: [SPARK-20730][SQL] Add an optimizer rule to combine nest...

2017-05-13 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/17970
  
+1, LGTM except one minor naming comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17964
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76901/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17964
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17644: [SPARK-17729] [SQL] Enable creating hive bucketed...

2017-05-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17644#discussion_r116359891
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -307,6 +307,27 @@ case class InsertIntoHiveTable(
   }
 }
 
+table.bucketSpec match {
+  case Some(bucketSpec) =>
+// Writes to bucketed hive tables are allowed only if user does 
not care about maintaining
+// table's bucketing ie. both "hive.enforce.bucketing" and 
"hive.enforce.sorting" are
+// set to false
+val enforceBucketingConfig = "hive.enforce.bucketing"
+val enforceSortingConfig = "hive.enforce.sorting"
+
+val message = s"Output Hive table ${table.identifier} is bucketed 
but Spark" +
+  "currently does NOT populate bucketed output which is compatible 
with Hive."
+
+if (hadoopConf.get(enforceBucketingConfig, "true").toBoolean ||
+  hadoopConf.get(enforceSortingConfig, "true").toBoolean) {
+  throw new AnalysisException(message)
+} else {
+  logWarning(message + s" Inserting data anyways since both 
$enforceBucketingConfig and " +
+s"$enforceSortingConfig are set to false.")
--- End diff --

shall we remove the bucket properties of the table in this case? what does 
hive do?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17970: [SPARK-20730][SQL] Add an optimizer rule to combi...

2017-05-13 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/17970#discussion_r116359887
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -111,7 +111,8 @@ abstract class Optimizer(sessionCatalog: 
SessionCatalog, conf: SQLConf)
   RemoveRedundantProject,
   SimplifyCreateStructOps,
   SimplifyCreateArrayOps,
-  SimplifyCreateMapOps) ++
+  SimplifyCreateMapOps,
+  CombineConcat) ++
--- End diff --

Hi, @maropu .
`CombineConcats` like the other `Combine~` optimizer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17964
  
**[Test build #76901 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76901/testReport)**
 for PR 17964 at commit 
[`49da955`](https://github.com/apache/spark/commit/49da955dce260260325708d07becbc692cd3a005).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17644: [SPARK-17729] [SQL] Enable creating hive bucketed...

2017-05-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17644#discussion_r116359799
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -408,9 +425,7 @@ private[hive] class HiveClientImpl(
 },
 schema = schema,
 partitionColumnNames = partCols.map(_.name),
-// We can not populate bucketing information for Hive tables as 
Spark SQL has a different
-// implementation of hash function from Hive.
-bucketSpec = None,
+bucketSpec = bucketSpec,
--- End diff --

please add a comment to say that, for data source tables, we will always 
overwrite the bucket spec in `HiveExternalCatalog` with the bucketing 
information in table properties.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17644: [SPARK-17729] [SQL] Enable creating hive bucketed...

2017-05-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17644#discussion_r116359692
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
 ---
@@ -17,6 +17,7 @@
 
 package org.apache.spark.sql.catalyst.catalog
 
+import org.apache.spark.sql.AnalysisException
--- End diff --

please remove these unnecessary changes in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17973: [SPARK-20731][SQL] Add ability to change or omit ...

2017-05-13 Thread mikkokupsu

Github user mikkokupsu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17973#discussion_r116359395
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -622,6 +622,31 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 }
   }
 
+  test("save tsv with tsv suffix") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
+  val cars = spark.read
+.format("csv")
+.option("header", "true")
+.load(testFile(carsFile))
+
+  cars.coalesce(1).write
+.option("header", "true")
+.option("fileExtension", ".tsv")
+.option("delimiter", "\t")
--- End diff --

Hi @dongjoon-huyn
Yes, the original goal was to remove the file extension but I decided to 
allow user decide.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-13 Thread zero323

GitHub user zero323 reopened a pull request:

https://github.com/apache/spark/pull/17965

 [SPARK-20726][SPARKR] wrapper for SQL broadcast

## What changes were proposed in this pull request?

- Adds R wrapper for `o.a.s.sql.functions.broadcast`.
- Renames `broadcast` to `broadcast_`.

## How was this patch tested?

Unit tests, check `check-cran.sh`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark SPARK-20726

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17965.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17965


commit f190d62460829dcfb84ff1a8e6dd6fe9cbd25719
Author: zero323 
Date:   2017-05-12T15:54:46Z

Initial implementation

commit 397ab1f7b4b4e2b9e51b697c92e3be197fed4554
Author: zero323 
Date:   2017-05-12T17:38:31Z

Fix style

commit 246b91f8af84115af8f6283fb783000c9cc613ec
Author: zero323 
Date:   2017-05-13T10:08:08Z

Style

commit 1530785f7469830446cd95717d524eb42d88e4ab
Author: zero323 
Date:   2017-05-13T10:38:50Z

Rename broadcast_ to broadcastRDD




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-13 Thread zero323

Github user zero323 closed the pull request at:

https://github.com/apache/spark/pull/17965


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...

2017-05-13 Thread kevinyu98

Github user kevinyu98 commented on the issue:

https://github.com/apache/spark/pull/12646
  
test please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17973: [SPARK-20731][SQL] Add ability to change or omit ...

2017-05-13 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/17973#discussion_r116359183
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -622,6 +622,31 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 }
   }
 
+  test("save tsv with tsv suffix") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
+  val cars = spark.read
+.format("csv")
+.option("header", "true")
+.load(testFile(carsFile))
+
+  cars.coalesce(1).write
+.option("header", "true")
+.option("fileExtension", ".tsv")
+.option("delimiter", "\t")
--- End diff --

Hi, @mikkokupsu 
Is the original goal to support the existing many files (without `.csv` 
extension)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17251: [SPARK-19910][SQL] `stack` should not reject NULL values...

2017-05-13 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/17251
  
Could this fix be part of Spark 2.2.0, @cloud-fan and @gatorsmile ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17941: [SPARK-20684][R] Expose createGlobalTempView and dropGlo...

2017-05-13 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/17941
  
Thank you for comments, @falaki and @felixcheung . I added the duplication 
link to the issue, SPARK-20684, and ask @falaki to close the JIRA issue because 
he is the reporter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17963: [SPARK-20722][CORE] Replay newer event log that hasn't b...

2017-05-13 Thread sharkdtu

Github user sharkdtu commented on the issue:

https://github.com/apache/spark/pull/17963
  
cc @srowen @ajbozarth 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16912: [SPARK-19576] [Core] Task attempt paths exist in ...

2017-05-13 Thread sharkdtu

Github user sharkdtu closed the pull request at:

https://github.com/apache/spark/pull/16912


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17956: [SPARK-18772][SQL] Avoid unnecessary conversion try for ...

2017-05-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17956
  
Thank you everybody sincerely.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17973: [SPARK-20731][SQL] Add ability to change or omit ...

2017-05-13 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17973#discussion_r116358020
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -622,6 +622,31 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 }
   }
 
+  test("save tsv with tsv suffix") {
+withTempDir { dir =>
+  val csvDir = new File(dir, "csv").getCanonicalPath
+  val cars = spark.read
+.format("csv")
+.option("header", "true")
+.load(testFile(carsFile))
+
+  cars.coalesce(1).write
+.option("header", "true")
+.option("fileExtension", ".tsv")
+.option("delimiter", "\t")
--- End diff --

I would like to suggest to leave this out if there is no better reason for 
now. Downside of this is, it looks this allows arbitrary name and it does not 
gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up 
to the user.

I added those extentions long ago and one of the motivation was auto 
detection of datasource like Haddop does (which we ended up with not adding it 
yet due to the cost of listing files and etc). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17956: [SPARK-18772][SQL] Avoid unnecessary conversion t...

2017-05-13 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17956


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16199: [SPARK-18772][SQL] NaN/Infinite float parsing in ...

2017-05-13 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16199


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17956: [SPARK-18772][SQL] Avoid unnecessary conversion try for ...

2017-05-13 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17956
  
thanks, merging to master/2.2!

I think this change is pretty safe, we can discuss 2 things later:
1. if we want to support more special strings like `Inf`
2. if we want to make it case-insensitive


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17298
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76897/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17298
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17298
  
**[Test build #76897 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76897/testReport)**
 for PR 17298 at commit 
[`08632fd`](https://github.com/apache/spark/commit/08632fdcee127bf43cf90f44139925e2c26b4946).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-13 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17938
  
we are going to support bucketing in hive style CREATE TABLE syntax soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17964
  
**[Test build #76901 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76901/testReport)**
 for PR 17964 at commit 
[`49da955`](https://github.com/apache/spark/commit/49da955dce260260325708d07becbc692cd3a005).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17964: [SPARK-20725][SQL] partial aggregate should behav...

2017-05-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17964#discussion_r116357380
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/SameResultSuite.scala ---
@@ -46,4 +48,10 @@ class SameResultSuite extends QueryTest with 
SharedSQLContext {
 
df.queryExecution.sparkPlan.find(_.isInstanceOf[FileSourceScanExec]).get
   .asInstanceOf[FileSourceScanExec]
   }
+
+  test("SPARK-20725: partial aggregate should behave correctly for 
sameResult") {
+val df1 = spark.range(10).agg(sum($"id"))
+val df2 = spark.range(10).agg(sum($"id"))
--- End diff --

Good catch! The reason is, 
`HashAggregateExec.requiredChildDistributionExpressions` is a 
`Option[Seq[Expression]]`, which is not treated as expressions of 
`HashAggregateExec`, and thus not touched by `QueryPlan.mapExpressions`.

I have fixed it in `QueryPlan`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17938
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17938
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76900/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17938
  
**[Test build #76900 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76900/testReport)**
 for PR 17938 at commit 
[`92fb3b3`](https://github.com/apache/spark/commit/92fb3b3e00a666ff3bd1eca4e5dee0cefcca2d55).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-13 Thread zero323

Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17938
  
@cloud-fan Thanks for the clarification. Just a thought - shouldn't we 
either support it consistently or don't support at all? Current behaviour is 
quite confusing and I don't think that documentation alone will cut it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17938
  
**[Test build #76900 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76900/testReport)**
 for PR 17938 at commit 
[`92fb3b3`](https://github.com/apache/spark/commit/92fb3b3e00a666ff3bd1eca4e5dee0cefcca2d55).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17938
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76899/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17938
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17938
  
**[Test build #76899 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76899/testReport)**
 for PR 17938 at commit 
[`b5babf6`](https://github.com/apache/spark/commit/b5babf65571661ca45880cd80a950959f66523a1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17965
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17965
  
**[Test build #76898 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76898/testReport)**
 for PR 17965 at commit 
[`1530785`](https://github.com/apache/spark/commit/1530785f7469830446cd95717d524eb42d88e4ab).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17965
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76898/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-13 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17938
  
When you omit `USING`, it's hive style CREATE TABLE syntax, which is very 
different from Spark. We should encourage users to use the spark style CREATE 
TABLE syntax and only document it(with USING statement).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17938
  
**[Test build #76899 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76899/testReport)**
 for PR 17938 at commit 
[`b5babf6`](https://github.com/apache/spark/commit/b5babf65571661ca45880cd80a950959f66523a1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-13 Thread zero323

Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17965#discussion_r116355839
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3769,3 +3769,33 @@ setMethod("alias",
 sdf <- callJMethod(object@sdf, "alias", data)
 dataFrame(sdf)
   })
+
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17965
  
**[Test build #76898 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76898/testReport)**
 for PR 17965 at commit 
[`1530785`](https://github.com/apache/spark/commit/1530785f7469830446cd95717d524eb42d88e4ab).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-13 Thread zero323

Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17965#discussion_r116355836
  
--- Diff: R/pkg/R/generics.R ---
@@ -799,6 +799,10 @@ setGeneric("write.df", function(df, path = NULL, ...) 
{ standardGeneric("write.d
 #' @export
 setGeneric("randomSplit", function(x, weights, seed) { 
standardGeneric("randomSplit") })
 
+#' @rdname broadcast
+#' @export
+setGeneric("broadcast", function(x) { standardGeneric("broadcast") })
--- End diff --

It doesn't seem to affect the docs so I don't think we have to touch this 
for now:


![image](https://cloud.githubusercontent.com/assets/1554276/26024791/88a39940-37d9-11e7-9f11-ac1510b59215.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17298
  
**[Test build #76897 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76897/testReport)**
 for PR 17298 at commit 
[`08632fd`](https://github.com/apache/spark/commit/08632fdcee127bf43cf90f44139925e2c26b4946).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...

2017-05-13 Thread zero323

Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17969#discussion_r116355659
  
--- Diff: R/pkg/R/mllib_wrapper.R ---
@@ -0,0 +1,61 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+#' S4 class that represents a Java ML model
+#'
+#' @param jobj a Java object reference to the backing Scala model
+#' @export
+#' @note JavaModel since 2.3.0
+setClass("JavaModel", representation(jobj = "jobj"))
+
+#' Makes predictions from a Java ML model
+#'
+#' @param object a Spark ML model.
+#' @param newData a SparkDataFrame for testing.
+#' @return \code{predict} returns a SparkDataFrame containing predicted 
value.
+#' @rdname spark.predict
+#' @aliases predict,JavaModel-method
--- End diff --

I am biased here, but I'll argue that it doesn't. Both `predict` and 
`write.ml` (same as `read.ml`) are extremely generic and in  general we don't 
provide any useful information about these. And the usage is already covered by 
class `examples`.  Finally we can use `@seealso` to provide a bit more R-is 
experience if you think it is not enough  Something around the lines of `lm` 
docs:


![image](https://cloud.githubusercontent.com/assets/1554276/26024731/2214f012-37d8-11e7-9afb-b750e9c647ff.png)


Moreover using this approach significantly reduces amount of clutter in the 
generated docs. There are shorter, argument list is focused on the important 
parts, same as `value`. See for example GLM docs below.  So IMHO this is 
actually a significant improvement.

Personally I would do the same with all the `prints` and `summaries` as 
well, although it wouldn't reduce the codebase (for now ð).  This would 
further shorten the docs and remove awkward descriptions like this:


![image](https://cloud.githubusercontent.com/assets/1554276/26024707/567b2020-37d7-11e7-8c21-260404d7767d.png)
 
And from the developer side it is a clear win. No mindless copy / paste / 
replace cycle and more time to provide useful examples.

 __Before__:


![image](https://cloud.githubusercontent.com/assets/1554276/26024648/1c36253c-37d6-11e7-9411-72c0c14c54a8.png)

__After__:


![image](https://cloud.githubusercontent.com/assets/1554276/26024653/2643bd64-37d6-11e7-8463-08662611cd37.png)

 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12646
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76895/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12646
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12646
  
**[Test build #76895 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76895/testReport)**
 for PR 12646 at commit 
[`63fab9f`](https://github.com/apache/spark/commit/63fab9f87af0a551efe9f9d3872ff17b972ee834).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-13 Thread zero323

Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17965#discussion_r116355102
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3769,3 +3769,33 @@ setMethod("alias",
 sdf <- callJMethod(object@sdf, "alias", data)
 dataFrame(sdf)
   })
+
+
+#' broadcast
+#' 
+#' Return a new SparkDataFrame marked as small enough for use in broadcast 
joins. 
+#' 
+#' Equivalent to hint(x, "broadcast).
--- End diff --

I double check this but for some reason `\code` here made `roxygen` unhappy 
when I tried it last time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-13 Thread zero323

Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17938
  
@gatorsmile  Huh...  in that case it looks like parser (?) needs a little 
bit of work, unless of course following are features.  

- Omitting `USING` doesn't work 

  ```sql
  CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, 
lastname STRING)
  CLUSTERED BY(user_id) INTO 256 BUCKETS
  ```
  with:

  ```
  Error in query: 
  Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 1, pos 0)
  
  == SQL ==
  CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, 
lastname STRING)
  ^^^
  CLUSTERED BY(user_id) INTO 256 BUCKETS
  ```

- Omitting `USING` adding `PARTITION BY` with column not present in the 
main clause (valid Hive DDL) doesn't work: 

  ```sql
  CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, 
lastname STRING)
  PARTITIONED BY (department STRING)
  CLUSTERED BY(user_id) INTO 256 BUCKETS
  ```
  with

  ```
  Error in query: 
  Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 1, pos 2)
  
  == SQL ==
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, 
lastname STRING)
  --^^^
PARTITIONED BY (department STRING)
CLUSTERED BY(user_id) INTO 256 BUCKETS
  ```

- `PARTITION BY` alone works:

  ```sql
  CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, 
lastname STRING)
  PARTITIONED BY (department STRING)
  ```

-   `PARTITION BY` with `USING` when partition column is in the main spec 
works:

 ```sql
CREATE TABLE user_info_bucketed(
  user_id BIGINT, firstname STRING, lastname STRING, department STRING)
USING parquet
PARTITIONED BY (department)
```

-  `CLUSTERED BY` +  `PARTITION BY` with `USING` when partition column is 
in the main spec works:

```sql
CREATE TABLE user_info_bucketed(
   user_id BIGINT, firstname STRING, lastname STRING, department STRING)
USING parquet
PARTITIONED BY (department)
CLUSTERED BY(user_id) INTO 256 BUCKETS 
```
- `PARTITION BY` when parition column is in the main spec, `USING` omitted:

```sql
CREATE TABLE user_info_bucketed(
 user_id BIGINT, firstname STRING, lastname STRING, department STRING)
PARTITIONED BY (department)
```
 
with:

```
Error in query: 
mismatched input ')' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 
'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 'ROW', 'WITH', 
'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 
'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'COST', 'CAST', 'SHOW', 
'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 
'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 
'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 
'START', 'TRANSA
 CTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IGNORE', 'IF', 'DIV', 'PERCENT', 
'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 
'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 
'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 
'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 
'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', 
DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 
'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 
'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 
'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'T
 RANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 
'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, 
BACKQUOTED_IDENTIFIER}(line 3, pos 30)

== SQL ==
CREATE TABLE user_info_bucketed(
  user_id BIGINT, firstname STRING, lastname STRING, department 
STRING)
PARTITIONED BY

[GitHub] spark issue #17973: [SPARK-20731][SQL] Add ability to change or omit .csv fi...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17973
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17973: [SPARK-20731][SQL] Add ability to change or omit .csv fi...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17973
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76896/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17964
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76893/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

2017-05-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17964
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17964: [SPARK-20725][SQL] partial aggregate should behave corre...

2017-05-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17964
  
**[Test build #76893 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76893/testReport)**
 for PR 17964 at commit 
[`557298e`](https://github.com/apache/spark/commit/557298e3d88c04910ebff9cdb1ae77a1537c83af).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 140 matches

Mail list logo