date:20180806

[GitHub] spark issue #21933: [SPARK-24917][CORE] make chunk size configurable

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21933
  
**[Test build #94346 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94346/testReport)**
 for PR 21933 at commit 
[`e2961eb`](https://github.com/apache/spark/commit/e2961eb86f689de83770de5c3a73838512a62001).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21980: [SPARK-25010][SQL] Rand/Randn should produce different v...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21980
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21980: [SPARK-25010][SQL] Rand/Randn should produce different v...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21980
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94332/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21980: [SPARK-25010][SQL] Rand/Randn should produce different v...

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21980
  
**[Test build #94332 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94332/testReport)**
 for PR 21980 at commit 
[`d4d8d0f`](https://github.com/apache/spark/commit/d4d8d0fd2597d52dd2da5b36da6f05a60d89d25e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21721: [SPARK-24748][SS] Support for reporting custom me...

2018-08-06 Thread arunmahadevan

Github user arunmahadevan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21721#discussion_r208106031
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala
 ---
@@ -196,6 +237,18 @@ trait ProgressReporter extends Logging {
 currentStatus = currentStatus.copy(isTriggerActive = false)
   }
 
+  /** Extract writer from the executed query plan. */
+  private def dataSourceWriter: Option[DataSourceWriter] = {
+if (lastExecution == null) return None
+lastExecution.executedPlan.collect {
+  case p if p.isInstanceOf[WriteToDataSourceV2Exec] =>
--- End diff --

yes, currently the progress is reported only for micro-batch mode. This 
should be supported for continuous mode as well when we start reporting 
progress, but needs some more work - 
https://issues.apache.org/jira/browse/SPARK-23887


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21933: [SPARK-24917][CORE] make chunk size configurable

2018-08-06 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21933
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21845: [SPARK-24886][INFRA] Fix the testing script to increase ...

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21845
  
**[Test build #94345 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94345/testReport)**
 for PR 21845 at commit 
[`7afc5c5`](https://github.com/apache/spark/commit/7afc5c52fa31595b1eb458100d37fe92f62e31aa).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21845: [SPARK-24886][INFRA] Fix the testing script to increase ...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21845
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21845: [SPARK-24886][INFRA] Fix the testing script to increase ...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21845
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1892/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21898
  
@rxin, here we seems indeed starting to hit the time limit now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21980: [SPARK-25010][SQL] Rand/Randn should produce different v...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21980
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21980: [SPARK-25010][SQL] Rand/Randn should produce different v...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21980
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94331/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...

2018-08-06 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21087#discussion_r208105302
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1490,6 +1495,8 @@ class SQLConf extends Serializable with Logging {
 
   def bucketingEnabled: Boolean = getConf(SQLConf.BUCKETING_ENABLED)
 
+  def bucketingMaxBuckets: Long = getConf(SQLConf.BUCKETING_MAX_BUCKETS)
--- End diff --

Do we still need `Long` instead of `Int`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21845: [SPARK-24886][INFRA] Fix the testing script to increase ...

2018-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21845
  
I am reopening this per 
https://github.com/apache/spark/pull/21898#issuecomment-410909703

cc @cloud-fan, @rxin and @shaneknapp 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21980: [SPARK-25010][SQL] Rand/Randn should produce different v...

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21980
  
**[Test build #94331 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94331/testReport)**
 for PR 21980 at commit 
[`f60a238`](https://github.com/apache/spark/commit/f60a2384f335b1c95e81a0c232299af9bb426654).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21845: [SPARK-24886][INFRA] Fix the testing script to in...

2018-08-06 Thread HyukjinKwon

GitHub user HyukjinKwon reopened a pull request:

https://github.com/apache/spark/pull/21845

[SPARK-24886][INFRA] Fix the testing script to increase timeout for Jenkins 
build (from 300m to 330m)

## What changes were proposed in this pull request?

Currently, looks we hit the time limit time to time. Looks better 
increasing the time a bit.

For instance, please see https://github.com/apache/spark/pull/21822

For clarification, current Jenkins timeout is already 361m:

![screen shot 2018-07-23 at 1 28 57 
pm](https://user-images.githubusercontent.com/6477701/43058956-c0f94880-8e7c-11e8-86ba-20cdd92a1f14.png)

This PR just proposes to fix the test script to increase it correspondingly.

*This PR does not target to change the build configuration*

## How was this patch tested?

Jenkins tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-24886

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21845.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21845


commit c57b74546b457d44e0c9d28d92a672fe5324d3ff
Author: hyukjinkwon 
Date:   2018-07-23T05:30:16Z

Fix the testing script to increase timeout for Jenkins build (from 300m to 
350m)

commit 7afc5c52fa31595b1eb458100d37fe92f62e31aa
Author: hyukjinkwon 
Date:   2018-07-23T05:38:09Z

Looks 330m good enough actually for now




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21898
  
**[Test build #94344 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94344/testReport)**
 for PR 21898 at commit 
[`1f71e65`](https://github.com/apache/spark/commit/1f71e6583f9f9f270d07323f15c731717e13518d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...

2018-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22019
  
Hm.. wait let me take a closer look.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1891/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21860
  
**[Test build #94343 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94343/testReport)**
 for PR 21860 at commit 
[`9fab662`](https://github.com/apache/spark/commit/9fab662a6f99ee06be96d0c996ec1209b25a3f84).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-06 Thread mengxr

Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/21898
  
test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...

2018-08-06 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21087#discussion_r208103944
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedWriteSuite.scala 
---
@@ -48,16 +49,40 @@ abstract class BucketedWriteSuite extends QueryTest 
with SQLTestUtils {
 intercept[AnalysisException](df.write.bucketBy(2, 
"k").saveAsTable("tt"))
   }
 
-  test("numBuckets be greater than 0 but less than 10") {
+  test("numBuckets be greater than 0 but less than default 
bucketing.maxBuckets (10)") {
 val df = Seq(1 -> "a", 2 -> "b").toDF("i", "j")
 
-Seq(-1, 0, 10).foreach(numBuckets => {
-  val e = intercept[AnalysisException](df.write.bucketBy(numBuckets, 
"i").saveAsTable("tt"))
-  assert(
-e.getMessage.contains("Number of buckets should be greater than 0 
but less than 10"))
+Seq(-1, 0, 11).foreach(numBuckets => {
--- End diff --

nit: Only two parts are necessary to be updated for ease of tracking 
updates. Other changes look unnecessary.
`10` -> `11`
`less than 10` -> `less than`



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21991: [SPARK-25018] [Infra] Use `Co-authored-by` and `Signed-o...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21991
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94329/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21991: [SPARK-25018] [Infra] Use `Co-authored-by` and `Signed-o...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21991
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21991: [SPARK-25018] [Infra] Use `Co-authored-by` and `Signed-o...

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21991
  
**[Test build #94329 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94329/testReport)**
 for PR 21991 at commit 
[`272d8fd`](https://github.com/apache/spark/commit/272d8fd4c6a46164069e2e3a892f016e9664cf5f).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22019: [SPARK-25040][SQL] Empty string for double and float typ...

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22019
  
**[Test build #94342 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94342/testReport)**
 for PR 22019 at commit 
[`ef57fdd`](https://github.com/apache/spark/commit/ef57fdd5b0a6f7f0b6343c91c6983d20bc67fb5b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22019: [SPARK-25040][SQL] Empty string for double and float typ...

2018-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22019
  
Looks few other types could potentially have this issue too. Let me fix 
them all here while I am here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22019: [SPARK-25040][SQL] Empty string for double and float typ...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22019
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22019: [SPARK-25040][SQL] Empty string for double and float typ...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22019
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1890/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22019: [SPARK-25040][SQL] Empty string for double and fl...

2018-08-06 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/22019

[SPARK-25040][SQL] Empty string for double and float types should be nulls 
in JSON

## What changes were proposed in this pull request?

This PR proposes to treat empty strings for double and float types as 
`null` consistently. Looks we mistakenly missed this corner case, which I guess 
is not that serious since this looks happened betwen 1.x and 2.x, and pretty 
corner case.

For an easy reproducer, in case of double, the code below raises an error:

```scala
spark.read.option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", 
"""{"a": 1.1, "b": 1.1}""").toDS).show()
```

```scala
Caused by: java.lang.RuntimeException: Cannot parse  as double.
  at 
org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeConverter$7$$anonfun$apply$10.applyOrElse(JacksonParser.scala:163)
  at 
org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeConverter$7$$anonfun$apply$10.applyOrElse(JacksonParser.scala:152)
  at 
org.apache.spark.sql.catalyst.json.JacksonParser.org$apache$spark$sql$catalyst$json$JacksonParser$$parseJsonToken(JacksonParser.scala:277)
  at 
org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeConverter$7.apply(JacksonParser.scala:152)
  at 
org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeConverter$7.apply(JacksonParser.scala:152)
  at 
org.apache.spark.sql.catalyst.json.JacksonParser.org$apache$spark$sql$catalyst$json$JacksonParser$$convertObject(JacksonParser.scala:312)
  at 
org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeStructRootConverter$1$$anonfun$apply$2.applyOrElse(JacksonParser.scala:71)
  at 
org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeStructRootConverter$1$$anonfun$apply$2.applyOrElse(JacksonParser.scala:70)
  at 
org.apache.spark.sql.catalyst.json.JacksonParser.org$apache$spark$sql$catalyst$json$JacksonParser$$parseJsonToken(JacksonParser.scala:277)
  at 
org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeStructRootConverter$1.apply(JacksonParser.scala:70)
  at 
org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeStructRootConverter$1.apply(JacksonParser.scala:70)
  at 
org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:368)
  at 
org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:363)
  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2491)
  at 
org.apache.spark.sql.catalyst.json.JacksonParser.parse(JacksonParser.scala:363)
  at 
org.apache.spark.sql.DataFrameReader$$anonfun$5$$anonfun$6.apply(DataFrameReader.scala:450)
  at 
org.apache.spark.sql.DataFrameReader$$anonfun$5$$anonfun$6.apply(DataFrameReader.scala:450)
  at 
org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
  ... 24 more
```

Unlike other types:

```scala
spark.read.option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", 
"""{"a": 1, "b": 1}""").toDS).show()
```

```
+++
|   a|   b|
+++
|null|null|
|   1|   1|
+++
```

## How was this patch tested?

Unit tests were added and manually tested.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark double-float-empty

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22019.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22019


commit ef57fdd5b0a6f7f0b6343c91c6983d20bc67fb5b
Author: hyukjinkwon 
Date:   2018-08-07T05:23:43Z

Empty string for double and float types should be nulls in JSON




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22019: [SPARK-25040][SQL] Empty string for double and float typ...

2018-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22019
  
cc @cloud-fan, @viirya and @fuqiliang


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

2018-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20666
  
That's not related to this change. The issue itself seems to be a behaviour 
change between 1.6 and 2.x for treating empty string as null or not in double 
and float, which is rather a corner case and which looks, yea, an issue. Let me 
try to fix it while I'm here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21919: [SPARK-24933][SS] Report numOutputRows in SinkPro...

2018-08-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21919#discussion_r208100226
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
 ---
@@ -46,6 +46,9 @@ case class WriteToDataSourceV2(writer: DataSourceWriter, 
query: LogicalPlan) ext
  * The physical plan for writing data into data source v2.
  */
 case class WriteToDataSourceV2Exec(writer: DataSourceWriter, query: 
SparkPlan) extends SparkPlan {
--- End diff --

shall we create a new plan for writing micro-batch sink? If we are going to 
add more streaming related functionalities to this plan, we'd better separate 
the batch and streaming writing plans.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-06 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21305
  
LGTM, pending jenkins


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1889/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21898
  
**[Test build #94341 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94341/testReport)**
 for PR 21898 at commit 
[`1f71e65`](https://github.com/apache/spark/commit/1f71e6583f9f9f270d07323f15c731717e13518d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-06 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21898
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16898: [SPARK-19563][SQL] avoid unnecessary sort in File...

2018-08-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16898#discussion_r208098556
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
 ---
@@ -119,23 +130,45 @@ object FileFormatWriter extends Logging {
   uuid = UUID.randomUUID().toString,
   serializableHadoopConf = new 
SerializableConfiguration(job.getConfiguration),
   outputWriterFactory = outputWriterFactory,
-  allColumns = queryExecution.logical.output,
-  partitionColumns = partitionColumns,
+  allColumns = allColumns,
   dataColumns = dataColumns,
-  bucketSpec = bucketSpec,
+  partitionColumns = partitionColumns,
+  bucketIdExpression = bucketIdExpression,
   path = outputSpec.outputPath,
   customPartitionLocations = outputSpec.customPartitionLocations,
   maxRecordsPerFile = options.get("maxRecordsPerFile").map(_.toLong)
 .getOrElse(sparkSession.sessionState.conf.maxRecordsPerFile)
 )
 
+// We should first sort by partition columns, then bucket id, and 
finally sorting columns.
+val requiredOrdering = partitionColumns ++ bucketIdExpression ++ 
sortColumns
+// the sort order doesn't matter
+val actualOrdering = 
queryExecution.executedPlan.outputOrdering.map(_.child)
--- End diff --

That would be great, but may need some refactoring.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1888/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22009#discussion_r208098178
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/memoryV2.scala
 ---
@@ -132,35 +134,15 @@ class MemoryV2CustomMetrics(sink: MemorySinkV2) 
extends CustomMetrics {
   override def json(): String = Serialization.write(Map("numRows" -> 
sink.numRows))
 }
 
-class MemoryWriter(sink: MemorySinkV2, batchId: Long, outputMode: 
OutputMode, schema: StructType)
--- End diff --

this is actually a batch writer not micro-batch, and is only used in the 
test. For writer API, micro-batch and continuous share the same interface, so 
we only need one streaming write implementation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22009
  
**[Test build #94340 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94340/testReport)**
 for PR 22009 at commit 
[`c224999`](https://github.com/apache/spark/commit/c22499964ac759670c3629c690f77018bc79a7c1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22009
  
**[Test build #94338 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94338/testReport)**
 for PR 22009 at commit 
[`29b4f33`](https://github.com/apache/spark/commit/29b4f33567ccf6335dfd7dbfee620008c7d50b81).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94338/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21305
  
**[Test build #94339 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94339/testReport)**
 for PR 21305 at commit 
[`e81790d`](https://github.com/apache/spark/commit/e81790d072ed66f1126d5918bd1a39222a9f5cfa).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21305
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1887/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21305
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-06 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21305
  
@cloud-fan, I've rebased and updated with the requested change to disallow 
missing columns, even if they're optional. Thanks for reviewing!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22009
  
**[Test build #94338 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94338/testReport)**
 for PR 22009 at commit 
[`29b4f33`](https://github.com/apache/spark/commit/29b4f33567ccf6335dfd7dbfee620008c7d50b81).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1886/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22009#discussion_r208095752
  
--- Diff: 
sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaBatchDataSourceV2.java
 ---
@@ -1,114 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package test.org.apache.spark.sql.sources.v2;
-
-import java.io.IOException;
-import java.util.List;
-
-import org.apache.spark.sql.execution.vectorized.OnHeapColumnVector;
-import org.apache.spark.sql.sources.v2.DataSourceOptions;
-import org.apache.spark.sql.sources.v2.DataSourceV2;
-import org.apache.spark.sql.sources.v2.ReadSupport;
-import org.apache.spark.sql.sources.v2.reader.*;
-import org.apache.spark.sql.types.DataTypes;
-import org.apache.spark.sql.types.StructType;
-import org.apache.spark.sql.vectorized.ColumnVector;
-import org.apache.spark.sql.vectorized.ColumnarBatch;
-
-
-public class JavaBatchDataSourceV2 implements DataSourceV2, ReadSupport {
--- End diff --

This is renamed to `JavaColumnarDataSourceV2`, to avoid confusion about 
batch vs streaming.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16898: [SPARK-19563][SQL] avoid unnecessary sort in File...

2018-08-06 Thread leachbj

Github user leachbj commented on a diff in the pull request:

https://github.com/apache/spark/pull/16898#discussion_r208094538
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
 ---
@@ -119,23 +130,45 @@ object FileFormatWriter extends Logging {
   uuid = UUID.randomUUID().toString,
   serializableHadoopConf = new 
SerializableConfiguration(job.getConfiguration),
   outputWriterFactory = outputWriterFactory,
-  allColumns = queryExecution.logical.output,
-  partitionColumns = partitionColumns,
+  allColumns = allColumns,
   dataColumns = dataColumns,
-  bucketSpec = bucketSpec,
+  partitionColumns = partitionColumns,
+  bucketIdExpression = bucketIdExpression,
   path = outputSpec.outputPath,
   customPartitionLocations = outputSpec.customPartitionLocations,
   maxRecordsPerFile = options.get("maxRecordsPerFile").map(_.toLong)
 .getOrElse(sparkSession.sessionState.conf.maxRecordsPerFile)
 )
 
+// We should first sort by partition columns, then bucket id, and 
finally sorting columns.
+val requiredOrdering = partitionColumns ++ bucketIdExpression ++ 
sortColumns
+// the sort order doesn't matter
+val actualOrdering = 
queryExecution.executedPlan.outputOrdering.map(_.child)
--- End diff --

@cloud-fan would it be possible to use the logical plan rather than the 
executedPlan?  If the optimizer decides the data is already sorted according 
according to the logical plan the executedPlan won't include the fields.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22011: [WIP][SPARK-24822][PySpark] Python support for ba...

2018-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22011#discussion_r208093660
  
--- Diff: python/pyspark/rdd.py ---
@@ -2429,6 +2441,29 @@ def _wrap_function(sc, func, deserializer, 
serializer, profiler=None):
   sc.pythonVer, broadcast_vars, 
sc._javaAccumulator)
 
 
+class RDDBarrier(object):
+
+"""
+.. note:: Experimental
+
+An RDDBarrier turns an RDD into a barrier RDD, which forces Spark to 
launch tasks of the stage
+contains this RDD together.
+"""
+
+def __init__(self, rdd):
+self.rdd = rdd
+self._jrdd = rdd._jrdd
+
+def mapPartitions(self, f, preservesPartitioning=False):
+"""
+Return a new RDD by applying a function to each partition of this 
RDD.
+"""
--- End diff --

shall we match the documentation, or why is it different?

FWIW, for coding block, just `` `blabla` `` should be good enough. Nicer if 
linked properly by like `` :class:`ClassName` ``.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22009
  
**[Test build #94337 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94337/testReport)**
 for PR 22009 at commit 
[`2fc3b05`](https://github.com/apache/spark/commit/2fc3b05dde2c405e2af72b815700780f114767c1).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94337/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22017: [SPARK-23938][SQL] Add map_zip_with function

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22017
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22017: [SPARK-23938][SQL] Add map_zip_with function

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22017
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94325/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22011: [WIP][SPARK-24822][PySpark] Python support for ba...

2018-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22011#discussion_r208093240
  
--- Diff: python/pyspark/rdd.py ---
@@ -2429,6 +2441,29 @@ def _wrap_function(sc, func, deserializer, 
serializer, profiler=None):
   sc.pythonVer, broadcast_vars, 
sc._javaAccumulator)
 
 
+class RDDBarrier(object):
+
+"""
+.. note:: Experimental
+
+An RDDBarrier turns an RDD into a barrier RDD, which forces Spark to 
launch tasks of the stage
--- End diff --

nit: `RDDBarrier` -> `RDD barrier`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22017: [SPARK-23938][SQL] Add map_zip_with function

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22017
  
**[Test build #94325 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94325/testReport)**
 for PR 22017 at commit 
[`ef56011`](https://github.com/apache/spark/commit/ef56011f03d8bae4634e5d3108e4d6502482383c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class ArrayDataMerger(elementType: DataType) `
  * `case class ArrayUnion(left: Expression, right: Expression) extends 
ArraySetLike`
  * `abstract class GetMapValueUtil extends BinaryExpression with 
ImplicitCastInputTypes `
  * `case class MapZipWith(left: Expression, right: Expression, function: 
Expression)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/22009
  
There must not be one. I thought you'd already started a PR, my mistake.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22011: [WIP][SPARK-24822][PySpark] Python support for ba...

2018-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22011#discussion_r208092650
  
--- Diff: python/pyspark/rdd.py ---
@@ -2429,6 +2441,29 @@ def _wrap_function(sc, func, deserializer, 
serializer, profiler=None):
   sc.pythonVer, broadcast_vars, 
sc._javaAccumulator)
 
 
+class RDDBarrier(object):
+
+"""
+.. note:: Experimental
+
+An RDDBarrier turns an RDD into a barrier RDD, which forces Spark to 
launch tasks of the stage
+contains this RDD together.
--- End diff --

ditto let's add `.. versionadded:: 2.4.0` at the end.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22009
  
@rdblue can you point me to the other PR? This is the only PR I send out 
for data source v2 API improvement. I'd appreciate your time to review it, 
thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21991: [SPARK-25018] [Infra] Use `Co-authored-by` and `Signed-o...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21991
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94323/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21991: [SPARK-25018] [Infra] Use `Co-authored-by` and `Signed-o...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21991
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22009
  
**[Test build #94337 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94337/testReport)**
 for PR 22009 at commit 
[`2fc3b05`](https://github.com/apache/spark/commit/2fc3b05dde2c405e2af72b815700780f114767c1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1885/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94336/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22009
  
**[Test build #94336 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94336/testReport)**
 for PR 22009 at commit 
[`cab6d28`](https://github.com/apache/spark/commit/cab6d2828dacaca6e62d3409c684d18a1fc861f2).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21991: [SPARK-25018] [Infra] Use `Co-authored-by` and `Signed-o...

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21991
  
**[Test build #94323 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94323/testReport)**
 for PR 21991 at commit 
[`b3347f8`](https://github.com/apache/spark/commit/b3347f855a854740dc2d5082ea42fd8c723c3484).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22011: [WIP][SPARK-24822][PySpark] Python support for ba...

2018-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22011#discussion_r208092333
  
--- Diff: python/pyspark/rdd.py ---
@@ -2406,6 +2406,18 @@ def toLocalIterator(self):
 sock_info = 
self.ctx._jvm.PythonRDD.toLocalIteratorAndServe(self._jrdd.rdd())
 return _load_from_socket(sock_info, self._jrdd_deserializer)
 
+def barrier(self):
--- End diff --

I don't know why we didn't mark the version so far here but we really 
should `.. versionadded:: 2.4.0` here or 

```
@since(2.4)
def barrier(self):
...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/22009
  
Does this replace the other PR? I haven't looked at that one yet. If this 
is ready to review and follows the doc, I can review it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22014: [SPARK-25036][SQL] avoid match may not be exhaustive in ...

2018-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22014
  
LGTM except those rather nits.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22014: [SPARK-25036][SQL] avoid match may not be exhaust...

2018-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22014#discussion_r208091824
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -709,6 +709,7 @@ object ScalaReflection extends ScalaReflection {
   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] match {
 case Schema(s: StructType, _) =>
   s.toAttributes
+case _ => throw new RuntimeException(s"$schemaFor is not supported at 
attributesFor()")
--- End diff --

How about this:

```scala
case other =>
  throw new UnsupportedOperationException(s"Attributes for type $other 
is not supported")
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22014: [SPARK-25036][SQL] avoid match may not be exhaust...

2018-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22014#discussion_r208091445
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala
 ---
@@ -67,6 +67,7 @@ case class ApproxCountDistinctForIntervals(
 (endpointsExpression.dataType, endpointsExpression.eval()) match {
   case (ArrayType(elementType, _), arrayData: ArrayData) =>
 arrayData.toObjectArray(elementType).map(_.toString.toDouble)
+  case _ => throw new RuntimeException("not found at endpoints")
--- End diff --

Can we do this like:

```scala
val endpointsType = endpointsExpression.dataType.asInstanceOf[ArrayType]
val endpoints = endpointsExpression.eval().asInstanceOf[ArrayData]

endpoints.toObjectArray(endpointsType.elementType).map(_.toString.toDouble)
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22014: [SPARK-25036][SQL] avoid match may not be exhaust...

2018-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22014#discussion_r208090085
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -471,6 +471,7 @@ class CodegenContext {
   case NewFunctionSpec(functionName, None, None) => functionName
   case NewFunctionSpec(functionName, Some(_), 
Some(innerClassInstance)) =>
 innerClassInstance + "." + functionName
+  case _ => null // nothing to do since addNewFunctionInteral() must 
return one of them
--- End diff --

Shall we throw an `IllegalArgumentException`? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22014: [SPARK-25036][SQL] avoid match may not be exhaust...

2018-08-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22014#discussion_r208089613
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala
 ---
@@ -86,6 +87,7 @@ object ValueInterval {
 val newMax = if (n1.max <= n2.max) n1.max else n2.max
 (Some(EstimationUtils.fromDouble(newMin, dt)),
   Some(EstimationUtils.fromDouble(newMax, dt)))
+  case _ => throw new RuntimeException(s"Not supported pair: $r1, $r2 
at intersect()")
--- End diff --

Shall we do `UnsupportedOperationException`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

2018-08-06 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21977#discussion_r208091782
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -60,14 +61,26 @@ private[spark] object PythonEvalType {
  */
 private[spark] abstract class BasePythonRunner[IN, OUT](
 funcs: Seq[ChainedPythonFunctions],
-bufferSize: Int,
-reuseWorker: Boolean,
 evalType: Int,
-argOffsets: Array[Array[Int]])
+argOffsets: Array[Array[Int]],
+conf: SparkConf)
   extends Logging {
 
   require(funcs.length == argOffsets.length, "argOffsets should have the 
same length as funcs")
 
+  private val bufferSize = conf.getInt("spark.buffer.size", 65536)
+  private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
+  private val memoryMb = {
+val allocation = conf.get(PYSPARK_EXECUTOR_MEMORY)
+if (reuseWorker) {
--- End diff --

No, I'm not sure where that is. Is it on the python side? If you can point 
me to it, I'll have a closer look.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22009
  
**[Test build #94336 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94336/testReport)**
 for PR 22009 at commit 
[`cab6d28`](https://github.com/apache/spark/commit/cab6d2828dacaca6e62d3409c684d18a1fc861f2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1884/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21898
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94326/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21898: [SPARK-24817][Core] Implement BarrierTaskContext.barrier...

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21898
  
**[Test build #94326 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94326/testReport)**
 for PR 21898 at commit 
[`1f71e65`](https://github.com/apache/spark/commit/1f71e6583f9f9f270d07323f15c731717e13518d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21991: [SPARK-25018] [Infra] Use `Co-authored-by` and `Signed-o...

2018-08-06 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21991
  
The failed test is `FlatMapGroupsWithStateSuite.flatMapGroupsWithState`. I 
saw it fails some times occasionally. I think it should not be related to this 
change. @HyukjinKwon @dbtsai 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-06 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21305#discussion_r208090428
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala ---
@@ -336,4 +337,124 @@ object DataType {
   case (fromDataType, toDataType) => fromDataType == toDataType
 }
   }
+
+  private val SparkGeneratedName = """col\d+""".r
+  private def isSparkGeneratedName(name: String): Boolean = name match {
+case SparkGeneratedName(_*) => true
+case _ => false
+  }
+
+  /**
+   * Returns true if the write data type can be read using the read data 
type.
+   *
+   * The write type is compatible with the read type if:
+   * - Both types are arrays, the array element types are compatible, and 
element nullability is
+   *   compatible (read allows nulls or write does not contain nulls).
+   * - Both types are maps and the map key and value types are compatible, 
and value nullability
+   *   is compatible  (read allows nulls or write does not contain nulls).
+   * - Both types are structs and each field in the read struct is present 
in the write struct and
+   *   compatible (including nullability), or is nullable if the write 
struct does not contain the
+   *   field. Write-side structs are not compatible if they contain fields 
that are not present in
+   *   the read-side struct.
+   * - Both types are atomic and the write type can be safely cast to the 
read type.
+   *
+   * Extra fields in write-side structs are not allowed to avoid 
accidentally writing data that
+   * the read schema will not read, and to ensure map key equality is not 
changed when data is read.
+   *
+   * @param write a write-side data type to validate against the read type
+   * @param read a read-side data type
+   * @return true if data written with the write type can be read using 
the read type
+   */
+  def canWrite(
+  write: DataType,
+  read: DataType,
+  resolver: Resolver,
+  context: String,
+  addError: String => Unit = (_: String) => {}): Boolean = {
+(write, read) match {
+  case (wArr: ArrayType, rArr: ArrayType) =>
+// run compatibility check first to produce all error messages
+val typesCompatible =
+  canWrite(wArr.elementType, rArr.elementType, resolver, context + 
".element", addError)
+
+if (wArr.containsNull && !rArr.containsNull) {
+  addError(s"Cannot write nullable elements to array of non-nulls: 
'$context'")
+  false
+} else {
+  typesCompatible
+}
+
+  case (wMap: MapType, rMap: MapType) =>
+// map keys cannot include data fields not in the read schema 
without changing equality when
+// read. map keys can be missing fields as long as they are 
nullable in the read schema.
+
+// run compatibility check first to produce all error messages
+val keyCompatible =
+  canWrite(wMap.keyType, rMap.keyType, resolver, context + ".key", 
addError)
+val valueCompatible =
+  canWrite(wMap.valueType, rMap.valueType, resolver, context + 
".value", addError)
+val typesCompatible = keyCompatible && valueCompatible
+
+if (wMap.valueContainsNull && !rMap.valueContainsNull) {
+  addError(s"Cannot write nullable values to map of non-nulls: 
'$context'")
+  false
+} else {
+  typesCompatible
+}
+
+  case (StructType(writeFields), StructType(readFields)) =>
+var fieldCompatible = true
+readFields.zip(writeFields).foreach {
+  case (rField, wField) =>
+val namesMatch = resolver(wField.name, rField.name) || 
isSparkGeneratedName(wField.name)
+val fieldContext = s"$context.${rField.name}"
+val typesCompatible =
+  canWrite(wField.dataType, rField.dataType, resolver, 
fieldContext, addError)
+
+if (!namesMatch) {
+  addError(s"Struct '$context' field name does not match (may 
be out of order): " +
+  s"expected '${rField.name}', found '${wField.name}'")
+  fieldCompatible = false
+} else if (!rField.nullable && wField.nullable) {
+  addError(s"Cannot write nullable values to non-null field: 
'$fieldContext'")
+  fieldCompatible = false
+} else if (!typesCompatible) {
+  // errors are added in the recursive call to canWrite above
+  fieldCompatible = false
+}
+}
+
+if (readFields.size > writeFields.size) {
+  val missingFieldsStr =

[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-08-06 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21305#discussion_r208090280
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeWriteCompatibilitySuite.scala
 ---
@@ -0,0 +1,395 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.types
+
+import scala.collection.mutable
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.analysis
+import org.apache.spark.sql.catalyst.expressions.Cast
+
+class DataTypeWriteCompatibilitySuite extends SparkFunSuite {
--- End diff --

I'm planning on adding this, but it would be great to get this in and I'll 
add the tests next. It would be great to get this in to no longer keep rebasing 
it! Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

2018-08-06 Thread fuqiliang

Github user fuqiliang commented on the issue:

https://github.com/apache/spark/pull/20666
  
for specify, the json file (Sanity4.json) is
`{"a":"a1","int":1,"other":4.4}
{"a":"a2","int":"","other":""}`

code ï¼

> val config = new SparkConf().setMaster("local[5]").setAppName("test")
> val sc = SparkContext.getOrCreate(config)
> val sql = new SQLContext(sc)
> 
>  val file_path = 
this.getClass.getClassLoader.getResource("Sanity4.json").getFile
>  val df = sql.read.schema(null).json(file_path)
>  df.show(30)



then in spark 1.6, result is
+---++-+
|  a| int|other|
+---++-+
| a1|   1|  4.4|
| a2|null| null|
+---++-+

root
 |-- a: string (nullable = true)
 |-- int: long (nullable = true)
 |-- other: double (nullable = true)

but in spark 2.2, result is
+++-+
|   a| int|other|
+++-+
|  a1|   1|  4.4|
|null|null| null|
+++-+

root
 |-- a: string (nullable = true)
 |-- int: long (nullable = true)
 |-- other: double (nullable = true)




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-06 Thread skambha

Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r208089990
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 ---
@@ -169,25 +181,50 @@ package object expressions  {
 })
   }
 
-  // Find matches for the given name assuming that the 1st part is a 
qualifier (i.e. table name,
-  // alias, or subquery alias) and the 2nd part is the actual name. 
This returns a tuple of
+  // Find matches for the given name assuming that the 1st two parts 
are qualifier
+  // (i.e. database name and table name) and the 3rd part is the 
actual column name.
+  //
+  // For example, consider an example where "db1" is the database 
name, "a" is the table name
+  // and "b" is the column name and "c" is the struct field name.
+  // If the name parts is db1.a.b.c, then Attribute will match
+  // Attribute(b, qualifier("db1,"a")) and List("c") will be the 
second element
+  var matches: (Seq[Attribute], Seq[String]) = nameParts match {
+case dbPart +: tblPart +: name +: nestedFields =>
+  val key = (dbPart.toLowerCase(Locale.ROOT),
+tblPart.toLowerCase(Locale.ROOT), 
name.toLowerCase(Locale.ROOT))
+  val attributes = collectMatches(name, 
qualified3Part.get(key)).filter {
+a => (resolver(dbPart, a.qualifier.head) && resolver(tblPart, 
a.qualifier.last))
+  }
+  (attributes, nestedFields)
+case all =>
+  (Seq.empty, Seq.empty)
+  }
+
+  // If there are no matches, then find matches for the given name 
assuming that
+  // the 1st part is a qualifier (i.e. table name, alias, or subquery 
alias) and the
+  // 2nd part is the actual name. This returns a tuple of
   // matched attributes and a list of parts that are to be resolved.
   //
   // For example, consider an example where "a" is the table name, "b" 
is the column name,
   // and "c" is the struct field name, i.e. "a.b.c". In this case, 
Attribute will be "a.b",
   // and the second element will be List("c").
-  val matches = nameParts match {
-case qualifier +: name +: nestedFields =>
-  val key = (qualifier.toLowerCase(Locale.ROOT), 
name.toLowerCase(Locale.ROOT))
-  val attributes = collectMatches(name, qualified.get(key)).filter 
{ a =>
-resolver(qualifier, a.qualifier.get)
+  matches = matches match {
--- End diff --

done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...

2018-08-06 Thread skambha

Github user skambha commented on the issue:

https://github.com/apache/spark/pull/17185
  
Thanks for the review.  I have addressed your comments and pushed the 
changes.  
@cloud-fan, Please take a look.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...

2018-08-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17185
  
**[Test build #94335 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94335/testReport)**
 for PR 17185 at commit 
[`5f7e5d7`](https://github.com/apache/spark/commit/5f7e5d7bddca593d72818b07d71f678bd0a1982d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21991: [SPARK-25018] [Infra] Use `Co-authored-by` and `Signed-o...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21991
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94324/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22018: [SPARK-25038][SQL] Accelerate Spark Plan generation when...

2018-08-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22018
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21721: [SPARK-24748][SS] Support for reporting custom me...

2018-08-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21721#discussion_r208089043
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala
 ---
@@ -196,6 +237,18 @@ trait ProgressReporter extends Logging {
 currentStatus = currentStatus.copy(isTriggerActive = false)
   }
 
+  /** Extract writer from the executed query plan. */
+  private def dataSourceWriter: Option[DataSourceWriter] = {
+if (lastExecution == null) return None
+lastExecution.executedPlan.collect {
+  case p if p.isInstanceOf[WriteToDataSourceV2Exec] =>
--- End diff --

this only works for microbatch mode, do we have a plan to support 
continuous mode?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 >

1 - 100 of 713 matches

Mail list logo