from:"cloud\-fan"

[GitHub] spark pull request #22976: [SPARK-25974][SQL]Optimizes Generates bytecode fo...

2018-11-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22976#discussion_r232552336
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala
 ---
@@ -68,62 +68,55 @@ object GenerateOrdering extends 
CodeGenerator[Seq[SortOrder], Ordering[InternalR
 genComparisons(ctx, ordering)
   }
 
+  /**
+   * Creates the variables for ordering based on the given order.
+   */
+  private def createOrderKeys(
+ctx: CodegenContext,
--- End diff --

4 space identation


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22955: [SPARK-25949][SQL] Add test for PullOutPythonUDFInJoinCo...

2018-11-11 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22955
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...

2018-11-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22938#discussion_r232550860
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -1813,6 +1817,7 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   val path = dir.getCanonicalPath
   primitiveFieldAndType
 .toDF("value")
+.repartition(1)
--- End diff --

why is the `repartition` required?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...

2018-11-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22938#discussion_r232550733
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -1115,6 +1115,7 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
 Row(null, null, null),
 Row(null, null, null),
 Row(null, null, null),
+Row(null, null, null),
--- End diff --

so for json data source, previous behavior is, we would skip the row even 
it's in PERMISSIVE mode. Shall we clearly mention it in the migration guide?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...

2018-11-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22938#discussion_r232550502
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -550,15 +550,23 @@ case class JsonToStructs(
   s"Input schema ${nullableSchema.catalogString} must be a struct, an 
array or a map.")
   }
 
-  // This converts parsed rows to the desired output by the given schema.
   @transient
-  lazy val converter = nullableSchema match {
-case _: StructType =>
-  (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else 
null
-case _: ArrayType =>
-  (rows: Iterator[InternalRow]) => if (rows.hasNext) 
rows.next().getArray(0) else null
-case _: MapType =>
-  (rows: Iterator[InternalRow]) => if (rows.hasNext) 
rows.next().getMap(0) else null
+  private lazy val castRow = nullableSchema match {
+case _: StructType => (row: InternalRow) => row
+case _: ArrayType => (row: InternalRow) => row.getArray(0)
+case _: MapType => (row: InternalRow) => row.getMap(0)
+  }
+
+  // This converts parsed rows to the desired output by the given schema.
+  private def convertRow(rows: Iterator[InternalRow]) = {
+if (rows.hasNext) {
+  val result = rows.next()
+  // JSON's parser produces one record only.
+  assert(!rows.hasNext)
+  castRow(result)
+} else {
+  throw new IllegalArgumentException("Expected one row from JSON 
parser.")
--- End diff --

This can only happen when we have a bug, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...

2018-11-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22938#discussion_r232550186
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -15,6 +15,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 3.0, the `from_json` functions supports two modes - 
`PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The 
default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` 
did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing 
of malformed JSON records. For example, the JSON string `{"a" 1}` with the 
schema `a INT` is converted to `null` by previous versions but Spark 3.0 
converts it to `Row(null)`.
 
+  - In Spark version 2.4 and earlier, JSON data source and the `from_json` 
function produced `null`s if there is no valid root JSON token in its input (` 
` for example). Since Spark 3.0, such input is treated as a bad record and 
handled according to specified mode. For example, in the `PERMISSIVE` mode the 
` ` input is converted to `Row(null, null)` if specified schema is `key STRING, 
value INT`. 
--- End diff --

> In Spark version 2.4 and earlier, JSON data source and the `from_json` 
function produced `null`s

Shall we update this? According to what you said, JSON data source can't 
produce null.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22998: [SPARK-26001][SQL]Reduce memory copy when writing decima...

2018-11-11 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22998
  
I think this is wrong. We have to zero out the bytes even writing a null 
decimal, so that 2 unsafe rows with same values(including null values) are 
exactly same(in binary format).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22990: [SPARK-25988] [SQL] Keep names unchanged when deduplicat...

2018-11-09 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22990
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22990: [SPARK-25988] [SQL] Keep names unchanged when deduplicat...

2018-11-08 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22990
  
good catch! LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22990: [SPARK-25988] [SQL] Keep names unchanged when ded...

2018-11-08 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22990#discussion_r232148751
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2856,6 +2856,59 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   checkAnswer(sql("select 26393499451 / (1e6 * 1000)"), 
Row(BigDecimal("26.393499451")))
 }
   }
+
+  test("self join with aliases on partitioned tables #1") {
--- End diff --

let's put the JIRA ticket number in the test name


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22990: [SPARK-25988] [SQL] Keep names unchanged when ded...

2018-11-08 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22990#discussion_r232148583
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2856,6 +2856,59 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   checkAnswer(sql("select 26393499451 / (1e6 * 1000)"), 
Row(BigDecimal("26.393499451")))
 }
   }
+
+  test("self join with aliases on partitioned tables #1") {
+withTempView("tmpView1", "tmpView2") {
+  withTable("tab1", "tab2") {
+sql(
+  """
+|CREATE TABLE `tab1` (`col1` INT, `TDATE` DATE)
+|USING CSV
+|PARTITIONED BY (TDATE)
+  """.stripMargin)
+spark.table("tab1").where("TDATE >= 
'2017-08-15'").createOrReplaceTempView("tmpView1")
+sql("CREATE TABLE `tab2` (`TDATE` DATE) USING parquet")
+sql(
+  """
+|CREATE OR REPLACE TEMPORARY VIEW tmpView2 AS
+|SELECT N.tdate, col1 AS aliasCol1
+|FROM tmpView1 N
+|JOIN tab2 Z
+|ON N.tdate = Z.tdate
+  """.stripMargin)
+withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") {
+  sql("SELECT * FROM tmpView2 x JOIN tmpView2 y ON x.tdate = 
y.tdate").collect()
+}
+  }
+}
+  }
+
+  test("self join with aliases on partitioned tables #2") {
+withTempView("tmp") {
+  withTable("tab1", "tab2") {
+sql(
+  """
+|CREATE TABLE `tab1` (`EX` STRING, `TDATE` DATE)
+|USING parquet
+|PARTITIONED BY (tdate)
+  """.stripMargin)
+sql("CREATE TABLE `tab2` (`TDATE` DATE) USING parquet")
+sql(
+  """
+|CREATE OR REPLACE TEMPORARY VIEW TMP as
+|SELECT  N.tdate, EX AS new_ex
+|FROM tab1 N
+|JOIN tab2 Z
+|ON  N.tdate = Z.tdate
--- End diff --

nit: `ON N.tdate = Z.tdate`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22987: [SPARK-25979][SQL] Window function: allow parenth...

2018-11-08 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22987#discussion_r232135028
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/window.sql ---
@@ -109,3 +109,9 @@ last_value(false, false) OVER w AS 
last_value_contain_null
 FROM testData
 WINDOW w AS ()
 ORDER BY cate, val;
+
+-- parentheses around window reference
+SELECT cate, sum(val) OVER (w)
+FROM testData
+WHERE val is not null
+WINDOW w AS (PARTITION BY cate ORDER BY val);
--- End diff --

need a new line at the end.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22987: [SPARK-25979][SQL] Window function: allow parenth...

2018-11-08 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22987#discussion_r231992909
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/SQLWindowFunctionSuite.scala
 ---
@@ -31,6 +32,19 @@ class SQLWindowFunctionSuite extends QueryTest with 
SharedSQLContext {
 
   import testImplicits._
 
+  val empSalaryData = Seq(
--- End diff --

We really just need a simple test that proves `over (w)` works.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22987: [SPARK-25979][SQL] Window function: allow parenth...

2018-11-08 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22987#discussion_r231992777
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/SQLWindowFunctionSuite.scala
 ---
@@ -31,6 +32,19 @@ class SQLWindowFunctionSuite extends QueryTest with 
SharedSQLContext {
 
   import testImplicits._
 
+  val empSalaryData = Seq(
--- End diff --

can we just add a new test in `window.sql`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22978: [SPARK-25676][SQL][FOLLOWUP] Use 'foreach(_ => ())'

2018-11-08 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22978
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22984: [minor] update HiveExternalCatalogVersionsSuite t...

2018-11-08 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22984#discussion_r231922622
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
 ---
@@ -206,7 +206,7 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 
 object PROCESS_TABLES extends QueryTest with SQLTestUtils {
   // Tests the latest version of every release line.
-  val testingVersions = Seq("2.1.3", "2.2.2", "2.3.2")
+  val testingVersions = Seq("2.1.3", "2.2.2", "2.3.2", "2.4.0")
--- End diff --

when will we drop 2.1 officially?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22984: [minor] update HiveExternalCatalogVersionsSuite to test ...

2018-11-08 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22984
  
cc @gatorsmile @srowen 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22984: [minor ]update HiveExternalCatalogVersionsSuite t...

2018-11-08 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/22984

[minor ]update HiveExternalCatalogVersionsSuite to test 2.4.0

## What changes were proposed in this pull request?

Since Spark 2.4.0 is released, we should test it in 
HiveExternalCatalogVersionsSuite

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark minor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22984.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22984


commit b9817b4e882a8fafc4a8eba498b969e513e64fa3
Author: Wenchen Fan 
Date:   2018-11-08T14:58:54Z

update HiveExternalCatalogVersionsSuite to test 2.4.0




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22977: [BUILD] Bump previousSparkVersion in MimaBuild.sc...

2018-11-08 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22977#discussion_r231872904
  
--- Diff: project/MimaExcludes.scala ---
@@ -84,7 +84,17 @@ object MimaExcludes {
 
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.streaming.api.java.JavaPairDStream.flatMapValues"),
 // [SPARK-25680] SQL execution listener shouldn't happen on execution 
thread
 
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.sql.util.ExecutionListenerManager.clone"),
-
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.util.ExecutionListenerManager.this")
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.util.ExecutionListenerManager.this"),
+// [SPARK-25862][SQL] Remove rangeBetween APIs introduced in 
SPARK-21608
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.functions.unboundedFollowing"),
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.functions.unboundedPreceding"),
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.functions.currentRow"),
+
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.sql.expressions.Window.rangeBetween"),
+
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.sql.expressions.WindowSpec.rangeBetween"),
+// [SPARK-23781][CORE] Merge token renewer functionality into 
HadoopDelegationTokenManager
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.deploy.SparkHadoopUtil.nextCredentialRenewalTime"),
--- End diff --

This is actually a private method, I'm not sure why mima tracks it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22977: [BUILD] Bump previousSparkVersion in MimaBuild.sc...

2018-11-08 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22977#discussion_r231866958
  
--- Diff: project/MimaBuild.scala ---
@@ -88,7 +88,7 @@ object MimaBuild {
 
   def mimaSettings(sparkHome: File, projectRef: ProjectRef) = {
 val organization = "org.apache.spark"
-val previousSparkVersion = "2.2.0"
+val previousSparkVersion = "2.4.0"
--- End diff --

up to my understanding, we should have changed it to 2.3.0 when 2.3.0 was 
released.

Maybe we should send a another PR to branch-2.4 and change it to 2.3.0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22977: [BUILD] Bump previousSparkVersion in MimaBuild.scala to ...

2018-11-08 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22977
  
I think one major issue is, there is no document about how to update mima 
with new releases. Anyone knows the detailed process? Seems we need to update 
`MimaExcludes.scala` with something like `lazy val v30excludes = v24excludes ++ 
Seq` when cut the new branch, and update `MimaBuild.scala` when the new release 
is published.

And there are 2 remaining issues.
1. the data source v2 changes broke a lot of mima rules, while I expect 
interfaces marked as "Envolving" should not be tracked by mima.
2. mllib broke a lot of mima rules, seems caused by 
https://github.com/apache/spark/pull/22921 . @srowen  can you take a look?

Also cc @JoshRosen @gatorsmile @shaneknapp @vanzin @holdenk @felixcheung 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22977: [BUILD] Bump previousSparkVersion in MimaBuild.sc...

2018-11-08 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/22977

[BUILD] Bump previousSparkVersion in MimaBuild.scala to be 2.4.0

## What changes were proposed in this pull request?

Since Spark 2.4.0 is already in maven repo, we can Bump 
previousSparkVersion in MimaBuild.scala to be 2.4.0.

Note that, seems we forgot to do it for 2.4.0, so this PR also updates 
MimaExcludes.scala

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark mima

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22977.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22977


commit a0357782e75967d309dc5229b3c36a0c295f2956
Author: Wenchen Fan 
Date:   2018-11-08T11:34:31Z

Bump previousSparkVersion in MimaBuild.scala to be 2.2.0




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22970: [SPARK-25676][FOLLOWUP][BUILD] Fix Scala 2.12 bui...

2018-11-08 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22970#discussion_r231833555
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/WideTableBenchmark.scala
 ---
@@ -42,7 +43,7 @@ object WideTableBenchmark extends SqlBasedBenchmark {
   Seq("10", "100", "1024", "2048", "4096", "8192", "65536").foreach { 
n =>
 benchmark.addCase(s"split threshold $n", numIters = 5) { iter =>
   withSQLConf(SQLConf.CODEGEN_METHOD_SPLIT_THRESHOLD.key -> n) {
-df.selectExpr(columns: _*).foreach(identity(_))
+df.selectExpr(columns: _*).foreach((x => x): Row => Unit)
--- End diff --

shall we use `foreach(_ => ())`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...

2018-11-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22938#discussion_r231762733
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -550,15 +550,33 @@ case class JsonToStructs(
   s"Input schema ${nullableSchema.catalogString} must be a struct, an 
array or a map.")
   }
 
-  // This converts parsed rows to the desired output by the given schema.
   @transient
-  lazy val converter = nullableSchema match {
-case _: StructType =>
-  (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else 
null
-case _: ArrayType =>
-  (rows: Iterator[InternalRow]) => if (rows.hasNext) 
rows.next().getArray(0) else null
-case _: MapType =>
-  (rows: Iterator[InternalRow]) => if (rows.hasNext) 
rows.next().getMap(0) else null
+  private lazy val castRow = nullableSchema match {
+case _: StructType => (row: InternalRow) => row
+case _: ArrayType => (row: InternalRow) =>
+  if (row.isNullAt(0)) {
+new GenericArrayData(Array())
--- End diff --

I think this is the place `from_json` is different from json data source. A 
data source must produce data as rows, while the `from_json` can return array 
or map.

I think the previous behavior also makes sense. For array/map, we don't 
have the corrupted column,  and returning null is reasonable. Actually I prefer 
null over empty array/map, but we need more discussion about this behavior.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...

2018-11-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22938#discussion_r231745125
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -15,6 +15,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 3.0, the `from_json` functions supports two modes - 
`PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The 
default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` 
did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing 
of malformed JSON records. For example, the JSON string `{"a" 1}` with the 
schema `a INT` is converted to `null` by previous versions but Spark 3.0 
converts it to `Row(null)`.
 
+  - In Spark version 2.4 and earlier, JSON data source and the `from_json` 
function produced `null`s if there is no valid root JSON token in its input (` 
` for example). Since Spark 3.0, such input is treated as a bad record and 
handled according to specified mode. For example, in the `PERMISSIVE` mode the 
` ` input is converted to `Row(null, null)` if specified schema is `key STRING, 
value INT`. 
--- End diff --

just for curiosity, how can the json data source return null rows?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22958: [SPARK-25952][SQL] Passing actual schema to JacksonParse...

2018-11-07 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22958
  
good catch! do we need to fix the CSV side?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22818: [SPARK-25904][CORE] Allocate arrays smaller than Int.Max...

2018-11-07 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22818
  
since this is a bug fix, shall we also backport it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22943: [SPARK-25098][SQL] Trim the string when cast stri...

2018-11-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22943#discussion_r231382309
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
 ---
@@ -140,16 +140,10 @@ class DateTimeUtilsSuite extends SparkFunSuite {
 c = Calendar.getInstance()
 c.set(2015, 2, 18, 0, 0, 0)
 c.set(Calendar.MILLISECOND, 0)
-assert(stringToDate(UTF8String.fromString("2015-03-18")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18 ")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18 123142")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18T123123")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18T")).get ===
-  millisToDays(c.getTimeInMillis))
+Seq("2015-03-18", "2015-03-18 ", " 2015-03-18", " 2015-03-18 ", 
"2015-03-18 123142",
--- End diff --

ah i see


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22943: [SPARK-25098][SQL] Trim the string when cast stri...

2018-11-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22943#discussion_r231380552
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
 ---
@@ -140,16 +140,10 @@ class DateTimeUtilsSuite extends SparkFunSuite {
 c = Calendar.getInstance()
 c.set(2015, 2, 18, 0, 0, 0)
 c.set(Calendar.MILLISECOND, 0)
-assert(stringToDate(UTF8String.fromString("2015-03-18")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18 ")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18 123142")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18T123123")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18T")).get ===
-  millisToDays(c.getTimeInMillis))
+Seq("2015-03-18", "2015-03-18 ", " 2015-03-18", " 2015-03-18 ", 
"2015-03-18 123142",
--- End diff --

the test result doesn't change?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22956: [SPARK-25950][SQL] from_csv should respect to spark.sql....

2018-11-06 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22956
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22956: [SPARK-25950][SQL] from_csv should respect to spa...

2018-11-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22956#discussion_r231359024
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
 ---
@@ -92,8 +93,14 @@ case class CsvToStructs(
 }
   }
 
+  val nameOfCorruptRecord = 
SQLConf.get.getConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD)
--- End diff --

should this be private?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...

2018-11-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22944#discussion_r231358749
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -262,25 +262,39 @@ object AppendColumns {
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
+}
 new AppendColumns(
   func.asInstanceOf[Any => Any],
   implicitly[Encoder[T]].clsTag.runtimeClass,
   implicitly[Encoder[T]].schema,
   UnresolvedDeserializer(encoderFor[T].deserializer),
-  encoderFor[U].namedExpressions,
+  namedExpressions,
   child)
   }
 
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   inputAttributes: Seq[Attribute],
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
--- End diff --

I wouldn't special-case primitive type while this is a general problem.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...

2018-11-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22944#discussion_r231358690
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -262,25 +262,39 @@ object AppendColumns {
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
+}
 new AppendColumns(
   func.asInstanceOf[Any => Any],
   implicitly[Encoder[T]].clsTag.runtimeClass,
   implicitly[Encoder[T]].schema,
   UnresolvedDeserializer(encoderFor[T].deserializer),
-  encoderFor[U].namedExpressions,
+  namedExpressions,
   child)
   }
 
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   inputAttributes: Seq[Attribute],
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
--- End diff --

I wouldn't special-case primitive type while this is a general problem.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...

2018-11-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22944#discussion_r231201502
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -262,25 +262,39 @@ object AppendColumns {
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
+}
 new AppendColumns(
   func.asInstanceOf[Any => Any],
   implicitly[Encoder[T]].clsTag.runtimeClass,
   implicitly[Encoder[T]].schema,
   UnresolvedDeserializer(encoderFor[T].deserializer),
-  encoderFor[U].namedExpressions,
+  namedExpressions,
   child)
   }
 
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   inputAttributes: Seq[Attribute],
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
--- End diff --

I mean, maybe we should just leave this problem. I'm not sure how hacky it 
is to detect the `AppendColumns` in this case. Maybe we can have more 
confidence if you have a PR ready.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22873: [SPARK-25866][ML] Update KMeans formatVersion

2018-11-06 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22873
  
thanks, merging to master/2.4!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...

2018-11-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22944#discussion_r231159305
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -262,25 +262,39 @@ object AppendColumns {
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
+}
 new AppendColumns(
   func.asInstanceOf[Any => Any],
   implicitly[Encoder[T]].clsTag.runtimeClass,
   implicitly[Encoder[T]].schema,
   UnresolvedDeserializer(encoderFor[T].deserializer),
-  encoderFor[U].namedExpressions,
+  namedExpressions,
   child)
   }
 
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   inputAttributes: Seq[Attribute],
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
--- End diff --

ah i see. Then maybe we should just leave it instead of hacking the 
`AttributeSeq.resolve`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...

2018-11-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22944#discussion_r231129654
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -262,25 +262,39 @@ object AppendColumns {
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
+}
 new AppendColumns(
   func.asInstanceOf[Any => Any],
   implicitly[Encoder[T]].clsTag.runtimeClass,
   implicitly[Encoder[T]].schema,
   UnresolvedDeserializer(encoderFor[T].deserializer),
-  encoderFor[U].namedExpressions,
+  namedExpressions,
   child)
   }
 
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   inputAttributes: Seq[Attribute],
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
--- End diff --

> Should we only fail the groupByKey query accessing ambiguous field names?

Yes. When we have unresolved attributes, check if the child plan is 
`AppendColumns`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22944#discussion_r230997935
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -262,25 +262,39 @@ object AppendColumns {
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
+}
 new AppendColumns(
   func.asInstanceOf[Any => Any],
   implicitly[Encoder[T]].clsTag.runtimeClass,
   implicitly[Encoder[T]].schema,
   UnresolvedDeserializer(encoderFor[T].deserializer),
-  encoderFor[U].namedExpressions,
+  namedExpressions,
   child)
   }
 
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   inputAttributes: Seq[Attribute],
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
--- End diff --

is this a special case of option of product? can you try pritimive type and 
product type?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22547: [SPARK-25528][SQL] data source V2 read side API r...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22547#discussion_r230989559
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousInputStream.scala
 ---
@@ -46,17 +45,22 @@ import org.apache.spark.sql.types.StructType
  *   scenarios, where some offsets after the specified 
initial ones can't be
  *   properly read.
  */
-class KafkaContinuousReadSupport(
+class KafkaContinuousInputStream(
--- End diff --

Yea I'll separate this PR into 3 smaller ones, after we have agreed on the 
high-level design at 
https://docs.google.com/document/d/1uUmKCpWLdh9vHxP7AWJ9EgbwB_U6T3EJYNjhISGmiQg/edit?usp=sharing


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22944#discussion_r230989073
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -262,25 +262,39 @@ object AppendColumns {
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
+}
 new AppendColumns(
   func.asInstanceOf[Any => Any],
   implicitly[Encoder[T]].clsTag.runtimeClass,
   implicitly[Encoder[T]].schema,
   UnresolvedDeserializer(encoderFor[T].deserializer),
-  encoderFor[U].namedExpressions,
+  namedExpressions,
   child)
   }
 
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   inputAttributes: Seq[Attribute],
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
--- End diff --

We can improve the `CheckAnalysis` to detect this case, and improve the 
error message to ask users to do alias.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22944#discussion_r230986226
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -262,25 +262,39 @@ object AppendColumns {
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
+}
 new AppendColumns(
   func.asInstanceOf[Any => Any],
   implicitly[Encoder[T]].clsTag.runtimeClass,
   implicitly[Encoder[T]].schema,
   UnresolvedDeserializer(encoderFor[T].deserializer),
-  encoderFor[U].namedExpressions,
+  namedExpressions,
   child)
   }
 
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   inputAttributes: Seq[Attribute],
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
--- End diff --

if that is the case, I feel it better to ask users to resolve conflict 
manually, by adding alias.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22944#discussion_r230977212
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -262,25 +262,39 @@ object AppendColumns {
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
+}
 new AppendColumns(
   func.asInstanceOf[Any => Any],
   implicitly[Encoder[T]].clsTag.runtimeClass,
   implicitly[Encoder[T]].schema,
   UnresolvedDeserializer(encoderFor[T].deserializer),
-  encoderFor[U].namedExpressions,
+  namedExpressions,
   child)
   }
 
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   inputAttributes: Seq[Attribute],
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
--- End diff --

so we may still fail if `T` and `U` are case classes and have conflict 
field names?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22928: [SPARK-25926][CORE] Move config entries in core module t...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22928
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22946: [SPARK-25943][SQL] Fail if mismatching nested struct fie...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22946
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22946: [SPARK-25943][SQL] Fail if mismatching nested struct fie...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22946
  
ah good catch! Can you also add a test?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22502: [SPARK-25474][SQL]When the "fallBackToHdfsForStats= true...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22502
  
@shahidki31 thanks for fixing it!

Do you know where we read `fallBackToHdfsForStats` currently and see if we 
can have a unified place to do it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22889: [SPARK-25882][SQL] Added a function to join two datasets...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22889
  
I think the problem is real, maybe we should not use `Seq` in the end-user 
API, but always use Array to be more Java-friendly. This can also avoid bugs 
like https://github.com/apache/spark/pull/22789

cc @rxin @hvanhovell what do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22949: [minor] update known_translations

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22949
  
Note that, these updates are generated by the script not me. If someone is 
not in the list, it means the script can figure out the full name without 
translation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22949: [minor] update known_translations

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22949#discussion_r230970453
  
--- Diff: dev/create-release/known_translations ---
@@ -203,3 +203,61 @@ shenh062326 - Shen Hong
 aokolnychyi - Anton Okolnychyi
 linbojin - Linbo Jin
 lw-lin - Liwei Lin
+10110346 - Xian Liu
+Achuth17 - Achuth Narayan Rajagopal
+Adamyuanyuan - Adam Wang
+DylanGuedes - Dylan Guedes
+JiahuiJiang - Jiahui Jiang
+KevinZwx - Kevin Zhang
+LantaoJin - Lantao Jin
+Lemonjing - Rann Tao
+LucaCanali - Luca Canali
+XD-DENG - Xiaodong Deng
+aai95 - Aleksei Izmalkin
+akonopko - Alexander Konopko
+ankuriitg - Ankur Gupta
+arucard21 - Riaas Mokiem
+attilapiros - Attila Zsolt Piros
+bravo-zhang - Bravo Zhang
+caneGuy - Kang Zhou
+chaoslawful - Xiaozhe Wang
+cluo512 - Chuan Luo
+codeatri - Neha Patil
+crafty-coder - Carlos Pena
+debugger87 - Chaozhong Yang
+e-dorigatti - Emilio Dorigatti
+eric-maynard - Eric Maynard
+felixalbani - Felix Albani
+fjh100456 - fjh100456
--- End diff --

ah I missed this one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22949: [minor] update known_translations

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22949
  
cc @gatorsmile


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22949: [minor] update known_translations

2018-11-05 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/22949

[minor] update known_translations

## What changes were proposed in this pull request?

update known_translations after running `translate-contributors.py` during 
2.4.0 release

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark contributors

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22949.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22949


commit 6cd37374fc1917fe4c590304183521e9de3c4d23
Author: Wenchen Fan 
Date:   2018-11-05T14:50:52Z

update known_translations




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22944#discussion_r230772018
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -1556,6 +1556,14 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   df.where($"city".contains(new java.lang.Character('A'))),
   Seq(Row("Amsterdam")))
   }
+
+  test("SPARK-25942: typed aggregation on primitive data") {
--- End diff --

how was this bug introduced?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22771: [SPARK-25773][Core]Cancel zombie tasks in a resul...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22771#discussion_r230730694
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1364,6 +1385,21 @@ private[spark] class DAGScheduler(
   if (job.numFinished == job.numPartitions) {
 markStageAsFinished(resultStage)
 cleanupStateForJobAndIndependentStages(job)
+try {
+  // killAllTaskAttempts will fail if a 
SchedulerBackend does not implement
+  // killTask.
+  logInfo(s"Job ${job.jobId} is finished. Cancelling 
potential speculative " +
+"or zombie tasks for this job")
+  // ResultStage is only used by this job. It's safe 
to kill speculative or
+  // zombie tasks in this stage.
+  taskScheduler.killAllTaskAttempts(
--- End diff --

cc @jiangxb1987 IIRC we have some similar code in barrier execution. Shall 
we create a util method to safely kill tasks?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22847: [SPARK-25850][SQL] Make the split threshold for the code...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22847
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21732
  
I think this is close, can you answer 
https://github.com/apache/spark/pull/21732/files#r228782670 ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21732#discussion_r230726457
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -1556,6 +1547,54 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   df.where($"city".contains(new java.lang.Character('A'))),
   Seq(Row("Amsterdam")))
   }
+
+  test("SPARK-24762: Enable top-level Option of Product encoders") {
+val data = Seq(Some((1, "a")), Some((2, "b")), None)
+val ds = data.toDS()
+
+checkDataset(
+  ds,
+  data: _*)
+
+val schema = StructType(Seq(
--- End diff --

can we use the `add` API? e.g.
```
new StructType().add(
  "value",
  new StructType()
.add("_1", ...)
.add("_2", ...))
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21732#discussion_r230726136
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala ---
@@ -393,4 +431,18 @@ class DatasetAggregatorSuite extends QueryTest with 
SharedSQLContext {
 assert(grouped.schema == df.schema)
 checkDataset(grouped.as[OptionBooleanData], OptionBooleanData("bob", 
Some(true)))
   }
+
+  test("SPARK-24762: Aggregator should be able to use Option of Product 
encoder") {
+val df = Seq(
+  OptionBooleanIntData("bob", Some((true, 1))),
+  OptionBooleanIntData("bob", Some((false, 2))),
+  OptionBooleanIntData("bob", None)).toDF()
+
+val group = df
+  .groupBy("name")
+  .agg(OptionBooleanIntAggregator("isGood").toColumn.alias("isGood"))
+assert(df.schema == group.schema)
--- End diff --

let's write down the expected schema
```
val expectedSchema = ...
assert(df.schema == expectedSchema)
assert(grouped.schema == ...)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22889: [SPARK-25882][SQL] Added a function to join two datasets...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22889
  
So we introduce a new API just to save typing `Seq(...)`? Maintaining an 
API has cost.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22928: [SPARK-25926][CORE] Move config entries in core module t...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22928
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22928: [SPARK-25926][CORE] Move config entries in core module t...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22928
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22923: [SPARK-25910][CORE] accumulator updates from previous st...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22923
  
> We need to always update user accumulators

Ah that's a good point.

I'm going to close it. I missed one thing: the `AppStatusListener` will 
keep the `StageInfo` instance until all tasks of that stage attempt is 
finished. See https://github.com/apache/spark/pull/22209

So this bug should already have been fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22923: [SPARK-25910][CORE] accumulator updates from prev...

2018-11-05 Thread cloud-fan

Github user cloud-fan closed the pull request at:

https://github.com/apache/spark/pull/22923


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22764: [SPARK-25765][ML] Add training cost to BisectingKMeans s...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22764
  
cc @dbtsai 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22786: [SPARK-25764][ML][EXAMPLES] Update BisectingKMeans examp...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22786
  
cc @dbtsai 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22869: [SPARK-25758][ML] Deprecate computeCost in BisectingKMea...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22869
  
cc @dbtsai 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...

2018-11-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22919#discussion_r230686892
  
--- Diff: bin/spark-shell ---
@@ -32,7 +32,10 @@ if [ -z "${SPARK_HOME}" ]; then
   source "$(dirname "$0")"/find-spark-home
 fi
 
-export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]"
+export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]
+
+Scala REPL options:
+  -Ipreload , enforcing line-by-line 
interpretation"
--- End diff --

I mean, I didn't find
```
Options:
  --master MASTER_URL spark://host:port, mesos://host:port, yarn,
  k8s://https://host:port, or local (Default: 
local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
("client") or
  on one of the worker machines inside the 
cluster ("cluster")
  (Default: client).
```

in the shell script. Where do we define them?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22942: [SPARK-25884][SQL][FOLLOW-UP] Add sample.json back.

2018-11-04 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22942
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...

2018-11-04 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22919#discussion_r230655513
  
--- Diff: bin/spark-shell ---
@@ -32,7 +32,10 @@ if [ -z "${SPARK_HOME}" ]; then
   source "$(dirname "$0")"/find-spark-home
 fi
 
-export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]"
+export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]
+
+Scala REPL options:
+  -Ipreload , enforcing line-by-line 
interpretation"
--- End diff --

where do we define other options?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22927: [SPARK-25918][SQL] LOAD DATA LOCAL INPATH should handle ...

2018-11-02 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22927
  
I'll list in as a known issue in 2.4.0, thanks for fixing it!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22847: [SPARK-25850][SQL] Make the split threshold for the code...

2018-11-01 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22847
  
did you address 
https://github.com/apache/spark/pull/22847#issuecomment-434836278 ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22029: [SPARK-24395][SQL] IN operator should return NULL when c...

2018-11-01 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22029
  
If we decide to follow PostgreSQL about the EQUAL behavior eventually, then 
it will be much easier to fix the IN behavior, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22923: [SPARK-25910][CORE] accumulator updates from previous st...

2018-11-01 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22923
  
cc @vanzin @zsxwing @jiangxb1987 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22923: [SPARK-25910][CORE] accumulator updates from prev...

2018-11-01 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/22923

[SPARK-25910][CORE] accumulator updates from previous stage attempt should 
not log error

## What changes were proposed in this pull request?

For shuffle map stages, we may have multiple attempts, while only the 
latest attempt is active. However, the scheduler still accepts successful tasks 
from previous attempts, to speed up the execution.

Each stage attempt has a `StageInfo` instance, which contains 
`TaskMetrics`. `TaskMetrics` has a bunch of accumulators to track the metrics 
like CPU time, etc. However, a stage only keeps the `StageInfo` of the latest 
attempt, which means the `StageInfo` of previous attempts will be GCed, and 
their accumulators of `TaskMetrics` will be cleaned.

This causes a problem: When the scheduler accepts a successful task from a 
previous attempt, and tries to update accumulators, we may fail to get the 
accumulators from `AccumulatorContext`, as they are already cleaned. And we may 
hit error log like
```
18/10/21 15:30:24 INFO ContextCleaner: Cleaned accumulator 2868 (name: 
internal.metrics.executorDeserializeTime)
18/10/21 15:30:24 ERROR DAGScheduler: Failed to update accumulators for 
task 7927
org.apache.spark.SparkException: attempted to access non-existent 
accumulator 2868
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1267)
...
```

This PR proposes a simple fix: When the scheduler receives successful tasks 
from previous attempts, don't update accumulators. Accumulators of previous 
stage attemps are not tracked anymore, so we don't need to update them.


## How was this patch tested?

a new test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark late-task

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22923.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22923


commit 07f900cf845662186f8d1daea3be9abe2633d5c0
Author: Wenchen Fan 
Date:   2018-11-01T15:40:14Z

accumulator updates from previous stage attempt

commit 4d9cbe043604e76b6367e4ecb42d0d36437d1792
Author: Wenchen Fan 
Date:   2018-11-01T16:04:41Z

different fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22029: [SPARK-24395][SQL] IN operator should return NULL when c...

2018-11-01 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22029
  
If we want to follow PostgreSQL/Oracle for the IN behavior, why don't we 
follow the EQUAL behavior as well?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22029: [SPARK-24395][SQL] IN operator should return NULL when c...

2018-11-01 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22029
  
Another point: I think it's also important to make the behavior of IN be 
consistent with EQUAL. I tried PostgreSQL and `(1, 2) = (3, null)` returns null.

Shall we update EQUAL first? The behavior of IN will be updated accordingly 
after we update `EQUAL`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...

2018-11-01 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22919
  
personally I think Spark Shell should be consistent with the upstream Scala 
Shell, otherwise we may get another ticket complaining why we didn't follow...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...

2018-11-01 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22919
  
so we would support both `-i` and `-I` in 2.4?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22029: [SPARK-24395][SQL] IN operator should return NULL when c...

2018-11-01 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22029
  
which Presto version did you test? I tried 0.203 and it fails
```
presto:default> select * from t2 where (1, 2) in (select x, y from t);
Query 20181101_085707_00012_n644a failed: line 1:31: Multiple columns 
returned by subquery are not yet supported. Found 2
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22029: [SPARK-24395][SQL] IN operator should return NULL when c...

2018-11-01 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22029
  
Do you know how Presto supports multi-value in subquery? By reading the PR 
description, it seems impossible if Preso treats `(a, b)` as a struct value. 
How Preso distinguish `(a, b) IN (select x,y ...)` and `struct_col IN (select 
x,y ...)`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22898: [SPARK-25746][SQL][followup] do not add unnecessary If e...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22898
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22626: [SPARK-25638][SQL] Adding new function - to_csv()

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22626
  
This needs to be rebased.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22892: [SPARK-25884][SQL] Add TBLPROPERTIES and COMMENT, and us...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22892
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22847: [SPARK-25850][SQL] Make the split threshold for the code...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22847
  
@rednaxelafx ah good point! It's hardcoded as 1024 too, and it's also doing 
method splitting. Let's apply the config there too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22029: [SPARK-24395][SQL] IN operator should return NULL...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22029#discussion_r229788060
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -212,27 +212,27 @@ object ReorderAssociativeOperator extends 
Rule[LogicalPlan] {
  * 1. Converts the predicate to false when the list is empty and
  *the value is not nullable.
  * 2. Removes literal repetitions.
- * 3. Replaces [[In (value, seq[Literal])]] with optimized version
+ * 3. Replaces [[In (values, seq[Literal])]] with optimized version
  *[[InSet (value, HashSet[Literal])]] which is much faster.
  */
 object OptimizeIn extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case q: LogicalPlan => q transformExpressionsDown {
-  case In(v, list) if list.isEmpty =>
+  case i @ In(_, list) if list.isEmpty =>
 // When v is not nullable, the following expression will be 
optimized
 // to FalseLiteral which is tested in OptimizeInSuite.scala
-If(IsNotNull(v), FalseLiteral, Literal(null, BooleanType))
-  case expr @ In(v, list) if expr.inSetConvertible =>
+If(IsNotNull(i.value), FalseLiteral, Literal(null, BooleanType))
+  case expr @ In(_, list) if expr.inSetConvertible =>
 val newList = ExpressionSet(list).toSeq
 if (newList.length == 1
   // TODO: `EqualTo` for structural types are not working. Until 
SPARK-24443 is addressed,
   // TODO: we exclude them in this rule.
-  && !v.isInstanceOf[CreateNamedStructLike]
+  && !expr.value.isInstanceOf[CreateNamedStructLike]
--- End diff --

well, I think for this case we should optimize it.

Anyway it follows the previous behavior, we can change it later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22029: [SPARK-24395][SQL] IN operator should return NULL...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22029#discussion_r22978
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -212,27 +212,27 @@ object ReorderAssociativeOperator extends 
Rule[LogicalPlan] {
  * 1. Converts the predicate to false when the list is empty and
  *the value is not nullable.
  * 2. Removes literal repetitions.
- * 3. Replaces [[In (value, seq[Literal])]] with optimized version
+ * 3. Replaces [[In (values, seq[Literal])]] with optimized version
  *[[InSet (value, HashSet[Literal])]] which is much faster.
  */
 object OptimizeIn extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case q: LogicalPlan => q transformExpressionsDown {
-  case In(v, list) if list.isEmpty =>
+  case i @ In(_, list) if list.isEmpty =>
 // When v is not nullable, the following expression will be 
optimized
 // to FalseLiteral which is tested in OptimizeInSuite.scala
-If(IsNotNull(v), FalseLiteral, Literal(null, BooleanType))
-  case expr @ In(v, list) if expr.inSetConvertible =>
+If(IsNotNull(i.value), FalseLiteral, Literal(null, BooleanType))
+  case expr @ In(_, list) if expr.inSetConvertible =>
 val newList = ExpressionSet(list).toSeq
 if (newList.length == 1
   // TODO: `EqualTo` for structural types are not working. Until 
SPARK-24443 is addressed,
   // TODO: we exclude them in this rule.
-  && !v.isInstanceOf[CreateNamedStructLike]
+  && !expr.value.isInstanceOf[CreateNamedStructLike]
--- End diff --

for your case, it's not `CreateNamedStructLike`, but just a struct type 
column?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22905#discussion_r229754853
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -306,7 +306,15 @@ case class FileSourceScanExec(
   withOptPartitionCount
 }
 
-withSelectedBucketsCount
+val withOptColumnCount = relation.fileFormat match {
+  case columnar: ColumnarFileFormat =>
+val sqlConf = relation.sparkSession.sessionState.conf
+val columnCount = columnar.columnCountForSchema(sqlConf, 
requiredSchema)
+withSelectedBucketsCount + ("ColumnCount" -> columnCount.toString)
--- End diff --

shall we only include this info when the columnar reader is on?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22907: [SPARK-25896][CORE][WIP] Accumulator should only ...

2018-10-31 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/22907

[SPARK-25896][CORE][WIP] Accumulator should only be updated once for each 
successful task in shuffle map stage

## What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/19877

For the same reason, we should also update accumulator once for each 
successful task in shuffle map stage.

TODO:
1. `ShuffleMapStage` has `pendingPartitions` and `findMissingPartitions`, 
I'm not sure which one is the single source of truth
2. When we receive repeated successful shuffle map tasks, seems we will 
override the previous one. Need a double check.
3. add tests.

## How was this patch tested?

TODO

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark accum

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22907.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22907


commit 7876d9c44c15fe1fb64f1d7587c97e23ff2be5a2
Author: Wenchen Fan 
Date:   2018-10-31T15:23:48Z

Accumulator should only be updated once for each successful task in shuffle 
map stage




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22029: [SPARK-24395][SQL] IN operator should return NULL...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22029#discussion_r229708259
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -212,27 +212,27 @@ object ReorderAssociativeOperator extends 
Rule[LogicalPlan] {
  * 1. Converts the predicate to false when the list is empty and
  *the value is not nullable.
  * 2. Removes literal repetitions.
- * 3. Replaces [[In (value, seq[Literal])]] with optimized version
+ * 3. Replaces [[In (values, seq[Literal])]] with optimized version
  *[[InSet (value, HashSet[Literal])]] which is much faster.
  */
 object OptimizeIn extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case q: LogicalPlan => q transformExpressionsDown {
-  case In(v, list) if list.isEmpty =>
+  case i @ In(_, list) if list.isEmpty =>
 // When v is not nullable, the following expression will be 
optimized
 // to FalseLiteral which is tested in OptimizeInSuite.scala
-If(IsNotNull(v), FalseLiteral, Literal(null, BooleanType))
-  case expr @ In(v, list) if expr.inSetConvertible =>
+If(IsNotNull(i.value), FalseLiteral, Literal(null, BooleanType))
+  case expr @ In(_, list) if expr.inSetConvertible =>
 val newList = ExpressionSet(list).toSeq
 if (newList.length == 1
   // TODO: `EqualTo` for structural types are not working. Until 
SPARK-24443 is addressed,
   // TODO: we exclude them in this rule.
-  && !v.isInstanceOf[CreateNamedStructLike]
+  && !expr.value.isInstanceOf[CreateNamedStructLike]
--- End diff --

IIUC, you mean
`expr.values.length > 1` => `expr.value.isInstanceOf[CreateNamedStructLike]`
but `expr.value.isInstanceOf[CreateNamedStructLike]` can't => 
`expr.values.length > 1`

Can you give an example?

Based on my understanding, the code here is trying to optimize a case when 
it's not a multi-value in and the list has only one element.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type which r...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22905
  
is there anything blocked by this? I agree this is a good feature, but it 
asks the data source to provide a new ability, which may become a problem when 
migrating file sources to data source v2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22029: [SPARK-24395][SQL] IN operator should return NULL...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22029#discussion_r229701584
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -212,27 +212,27 @@ object ReorderAssociativeOperator extends 
Rule[LogicalPlan] {
  * 1. Converts the predicate to false when the list is empty and
  *the value is not nullable.
  * 2. Removes literal repetitions.
- * 3. Replaces [[In (value, seq[Literal])]] with optimized version
+ * 3. Replaces [[In (values, seq[Literal])]] with optimized version
  *[[InSet (value, HashSet[Literal])]] which is much faster.
  */
 object OptimizeIn extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case q: LogicalPlan => q transformExpressionsDown {
-  case In(v, list) if list.isEmpty =>
+  case i @ In(_, list) if list.isEmpty =>
 // When v is not nullable, the following expression will be 
optimized
 // to FalseLiteral which is tested in OptimizeInSuite.scala
-If(IsNotNull(v), FalseLiteral, Literal(null, BooleanType))
-  case expr @ In(v, list) if expr.inSetConvertible =>
+If(IsNotNull(i.value), FalseLiteral, Literal(null, BooleanType))
+  case expr @ In(_, list) if expr.inSetConvertible =>
 val newList = ExpressionSet(list).toSeq
 if (newList.length == 1
   // TODO: `EqualTo` for structural types are not working. Until 
SPARK-24443 is addressed,
   // TODO: we exclude them in this rule.
-  && !v.isInstanceOf[CreateNamedStructLike]
+  && !expr.value.isInstanceOf[CreateNamedStructLike]
--- End diff --

shall we use `expr.values.length == 1` here to make it more clear?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22029: [SPARK-24395][SQL] IN operator should return NULL...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22029#discussion_r229700708
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -212,27 +212,34 @@ object ReorderAssociativeOperator extends 
Rule[LogicalPlan] {
  * 1. Converts the predicate to false when the list is empty and
  *the value is not nullable.
  * 2. Removes literal repetitions.
- * 3. Replaces [[In (value, seq[Literal])]] with optimized version
- *[[InSet (value, HashSet[Literal])]] which is much faster.
+ * 3. Replaces [[In (values, seq[Literal])]] with optimized version
+ *[[InSet (values, HashSet[Literal])]] which is much faster.
  */
 object OptimizeIn extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case q: LogicalPlan => q transformExpressionsDown {
-  case In(v, list) if list.isEmpty =>
-// When v is not nullable, the following expression will be 
optimized
+  case i @ In(values, list) if list.isEmpty =>
+// When values are not nullable, the following expression will be 
optimized
 // to FalseLiteral which is tested in OptimizeInSuite.scala
-If(IsNotNull(v), FalseLiteral, Literal(null, BooleanType))
-  case expr @ In(v, list) if expr.inSetConvertible =>
+val isNotNull = if (SQLConf.get.inFalseForNullField) {
+  IsNotNull(i.value)
+} else {
+  val valuesNotNull: Seq[Expression] = values.map(IsNotNull)
+  valuesNotNull.tail.foldLeft(valuesNotNull.head)(And)
--- End diff --

nit: `values.map(IsNotNull).reduce(And)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22029: [SPARK-24395][SQL] IN operator should return NULL...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22029#discussion_r229699828
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -339,37 +371,57 @@ case class In(value: Expression, list: 
Seq[Expression]) extends Predicate {
  * Optimized version of In clause, when all filter values of In clause are
  * static.
  */
-case class InSet(child: Expression, hset: Set[Any]) extends 
UnaryExpression with Predicate {
+case class InSet(values: Seq[Expression], hset: Set[Any]) extends InBase {
 
   require(hset != null, "hset could not be null")
 
-  override def toString: String = s"$child INSET ${hset.mkString("(", ",", 
")")}"
+  override def toString: String = s"$value INSET ${hset.mkString("(", ",", 
")")}"
 
-  @transient private[this] lazy val hasNull: Boolean = hset.contains(null)
+  override def children: Seq[Expression] = values
 
-  override def nullable: Boolean = child.nullable || hasNull
+  @transient private[this] lazy val hasNull: Boolean = {
+if (isMultiValued && !SQLConf.get.inFalseForNullField) {
+  hset.exists(checkNullEval)
+} else {
+  hset.contains(null)
+}
+  }
 
-  protected override def nullSafeEval(value: Any): Any = {
-if (set.contains(value)) {
-  true
-} else if (hasNull) {
+  override def nullable: Boolean = {
+val isValueNullable = if (isMultiValued && 
!SQLConf.get.inFalseForNullField) {
+  values.exists(_.nullable)
+} else {
+  value.nullable
+}
+isValueNullable || hasNull
+  }
+
+  override def eval(input: InternalRow): Any = {
+val inputValue = value.eval(input)
+if (checkNullEval(inputValue)) {
--- End diff --

do we change behavior here? seems `null inset (null, xxx)` returns true 
previously.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22029: [SPARK-24395][SQL] IN operator should return NULL...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22029#discussion_r229697077
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1561,6 +1561,16 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val LEGACY_IN_FALSE_FOR_NULL_FIELD =
+buildConf("spark.sql.legacy.inOperator.falseForNullField")
+  .internal()
+  .doc("When set to true (default), the IN operator returns false when 
comparing multiple " +
+"values containing a null. When set to false, it returns null, 
instead. This is " +
+"important especially when using NOT IN as in the second case, it 
filters out the rows " +
+"when a null is present in a field; while in the first one, those 
rows are returned.")
+  .booleanConf
+  .createWithDefault(true)
--- End diff --

shall we set `false` as default to follow SQL standard? and be consistent 
with in-subquery


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22029: [SPARK-24395][SQL] IN operator should return NULL...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22029#discussion_r229692081
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -202,7 +225,11 @@ case class InSubquery(values: Seq[Expression], query: 
ListQuery)
  */
 // scalastyle:off line.size.limit
 @ExpressionDescription(
-  usage = "expr1 _FUNC_(expr2, expr3, ...) - Returns true if `expr` equals 
to any valN.",
+  usage = """
+expr1 _FUNC_(expr2, expr3, ...) - Returns true if `expr` equals to any 
valN. Otherwise, if
+  spark.sql.legacy.inOperator.falseForNullField is false and any of 
the elements or fields of
--- End diff --

`any of the elements or fields ...`

We should explicitly mention multi-column IN, which is different from `a in 
(b, c, ...)` while `a` is struct type.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22892: [SPARK-25884][SQL] Add TBLPROPERTIES and COMMENT, and us...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22892
  
LGTM except some minor comments


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22892: [SPARK-25884][SQL] Add TBLPROPERTIES and COMMENT,...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22892#discussion_r229672667
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveShowCreateTableSuite.scala
 ---
@@ -0,0 +1,198 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import org.apache.spark.sql.{AnalysisException, ShowCreateTableSuite}
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+
+class HiveShowCreateTableSuite extends ShowCreateTableSuite with 
TestHiveSingleton {
+
+  test("simple hive table") {
+withTable("t1") {
+  sql(
+s"""CREATE TABLE t1 (
+   |  c1 INT COMMENT 'bla',
+   |  c2 STRING
+   |)
+   |TBLPROPERTIES (
+   |  'prop1' = 'value1',
+   |  'prop2' = 'value2'
+   |)
+ """.stripMargin
+  )
+
+  checkCreateTable("t1")
+}
+  }
+
+  test("simple external hive table") {
+withTempDir { dir =>
+  withTable("t1") {
+sql(
+  s"""CREATE TABLE t1 (
+ |  c1 INT COMMENT 'bla',
+ |  c2 STRING
+ |)
+ |LOCATION '${dir.toURI}'
+ |TBLPROPERTIES (
+ |  'prop1' = 'value1',
+ |  'prop2' = 'value2'
+ |)
+   """.stripMargin
+)
+
+checkCreateTable("t1")
+  }
+}
+  }
+
+  test("partitioned hive table") {
--- End diff --

do we have tests for partitioned/bucketed data source table?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22892: [SPARK-25884][SQL] Add TBLPROPERTIES and COMMENT,...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22892#discussion_r229671459
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -1063,21 +1067,19 @@ case class ShowCreateTableCommand(table: 
TableIdentifier) extends RunnableComman
 
 val dataSourceOptions = metadata.storage.properties.map {
   case (key, value) => s"${quoteIdentifier(key)} 
'${escapeSingleQuotedString(value)}'"
-} ++ metadata.storage.locationUri.flatMap { location =>
-  if (metadata.tableType == MANAGED) {
-// If it's a managed table, omit PATH option. Spark SQL always 
creates external table
-// when the table creation DDL contains the PATH option.
-None
-  } else {
-Some(s"path 
'${escapeSingleQuotedString(CatalogUtils.URIToString(location))}'")
-  }
 }
 
 if (dataSourceOptions.nonEmpty) {
   builder ++= "OPTIONS (\n"
   builder ++= dataSourceOptions.mkString("  ", ",\n  ", "\n")
   builder ++= ")\n"
 }
+
+if (metadata.tableType == EXTERNAL) {
--- End diff --

shall we also make it a method like `showTableComment`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22898: [SPARK-25746][SQL][followup] do not add unnecessa...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22898#discussion_r229667653
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
 ---
@@ -124,14 +124,9 @@ object ExpressionEncoder {
 s"`GetColumnByOrdinal`, but there are ${getColExprs.size}")
 
   val input = GetStructField(GetColumnByOrdinal(0, schema), index)
-  val newDeserializer = enc.objDeserializer.transformUp {
+  enc.objDeserializer.transformUp {
 case GetColumnByOrdinal(0, _) => input
   }
-  if (schema(index).nullable) {
-If(IsNull(input), Literal.create(null, newDeserializer.dataType), 
newDeserializer)
--- End diff --

good catch!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21860
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...

2018-10-31 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22788#discussion_r229585323
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out 
---
@@ -161,7 +161,7 @@ SELECT db1.t1.i1 FROM t1, mydb2.t1
 struct<>
 -- !query 18 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, 
mydb2.t1.i1]; line 1 pos 7
+cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, 
mydb2.t1.i1]; line 1 pos 7
--- End diff --

> Is it okay to drop the backtick from the first identifier

AFAIK both `name` and `sql` are for display/message. I think dropping 
backtick is fine if no ambiguous


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

501 - 600 of 17635 matches

Mail list logo