[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22194
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22194
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95133/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22194
  
**[Test build #95133 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95133/testReport)**
 for PR 22194 at commit 
[`967360a`](https://github.com/apache/spark/commit/967360a1a417739cdada3b7c7334c8ca87ede6a6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22165: [SPARK-25017][Core] Add test suite for BarrierCoordinato...

2018-08-22 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/22165
  
@jiangxb1987 Great thanks for your comment!
```
One general idea is that we don't need to rely on the RPC framework to test 
ContextBarrierState, just mock RpcCallContexts should be enough.
```
Actually I also want to implement like this at first also as you asked in 
jira, but `ContextBarrierState` is the private inner class in 
`BarrierCoordinator`. Could I do the refactor of moving `ContextBarrierState` 
out of `BarrierCoordinator`? If that is permitted I think we can just mock 
RpcCallContext to reach this.
```
We shall cover the following scenarios:
```
Pretty cool for the list, the 5 in front scenarios are including in 
currently implement, I'll add the last checking work of `Make sure we clear all 
the internal data under each case.` after we reach an agreement. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

2018-08-22 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/20146
  
seems like this was a thumbs-up from  @WeichenXu123 @jkbradley?
@dbtsai ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22121


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22121
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22112
  
To confirm, is everyone OK with merging this PR, or we are just OK with the 
direction and need more time to review this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95140/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22121
  
**[Test build #95140 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95140/testReport)**
 for PR 22121 at commit 
[`1f253bf`](https://github.com/apache/spark/commit/1f253bf536c3a7bd1c07ba5ea5600f661c8e106e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22121
  
**[Test build #95139 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95139/testReport)**
 for PR 22121 at commit 
[`8245806`](https://github.com/apache/spark/commit/824580684c05c2a3c1654517b77864ca5d504ee0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95139/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22187: [SPARK-25178][SQL] Directly ship the StructType objects ...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22187
  
**[Test build #95141 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95141/testReport)**
 for PR 22187 at commit 
[`81ef75a`](https://github.com/apache/spark/commit/81ef75a36a1c9dcd6922d2ec77393bc35389efd0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22187: [SPARK-25178][SQL] Directly ship the StructType objects ...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22187
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2475/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22187: [SPARK-25178][SQL] Directly ship the StructType objects ...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22187
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22163
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95130/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22163
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22163
  
**[Test build #95130 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95130/testReport)**
 for PR 22163 at commit 
[`f91e18c`](https://github.com/apache/spark/commit/f91e18c7d4b8eab53c4983320a0eab0403c37a48).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22121
  
The preview doc (zip file in PR description) is updated to latest version.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22121
  
**[Test build #95140 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95140/testReport)**
 for PR 22121 at commit 
[`1f253bf`](https://github.com/apache/spark/commit/1f253bf536c3a7bd1c07ba5ea5600f661c8e106e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/22112
  

@tgravescs:
> The shuffle simply transfers the bytes its supposed to. Sparks shuffle of 
those bytes is not consistent in that the order it fetches from can change and 
without the sort happening on that data the order can be different on rerun. I 
guess maybe you mean the ShuffledRDD as a whole or do you mean something else 
here?


By shuffle, I am referring to the output of shuffle which is be consumed by 
RDD with `ShuffleDependency` as input.
More specifically, the output of 
`SparkEnv.get.shuffleManager.getReader(...).read()` which RDD (user and spark 
impl's) uses to fetch output of shuffle machinery.
This output will not just be shuffle bytes/deserialize, but with 
aggregation applied (if specified) and ordering imposed (if specified).

ShuffledRDD is one such usage within spark core, but others exist within 
spark core and in user code.

> All I'm saying is zip is just another variant of this, you could document 
it as such and do nothing internal to spark to "fix it".

I agree; repartition + shuffle, zip, sample, mllib usages are all variants 
of the same problem - of shuffle output order being inconsistent.

> I guess we can separate out these 2 discussions. I think the point of 
this pr is to temporarily workaround the data loss/corruption issue with 
repartition by failing. So if everyone agrees on that lets move the discussion 
to a jira about what to do with the rest of the operators and fix repartition 
here. thoughts?

Sounds good to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2474/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22121
  
**[Test build #95139 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95139/testReport)**
 for PR 22121 at commit 
[`8245806`](https://github.com/apache/spark/commit/824580684c05c2a3c1654517b77864ca5d504ee0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22192
  
**[Test build #95138 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95138/testReport)**
 for PR 22192 at commit 
[`7c86fc5`](https://github.com/apache/spark/commit/7c86fc54c36954f1345eccc066873f7f90832657).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2473/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95129/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...

2018-08-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22152#discussion_r212183703
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 ---
@@ -69,10 +70,17 @@ private[sql] object JsonInferSchema {
   }.reduceOption(typeMerger).toIterator
 }
 
-// Here we get RDD local iterator then fold, instead of calling 
`RDD.fold` directly, because
-// `RDD.fold` will run the fold function in DAGScheduler event loop 
thread, which may not have
-// active SparkSession and `SQLConf.get` may point to the wrong 
configs.
-val rootType = 
mergedTypesFromPartitions.toLocalIterator.fold(StructType(Nil))(typeMerger)
+// Here we manually submit a fold-like Spark job, so that we can set 
the SQLConf when running
+// the fold functions in the scheduler event loop thread.
+val existingConf = SQLConf.get
+var rootType: DataType = StructType(Nil)
+val foldPartition = (iter: Iterator[DataType]) => 
iter.fold(StructType(Nil))(typeMerger)
+val mergeResult = (index: Int, taskResult: DataType) => {
+  rootType = SQLConf.withExistingConf(existingConf) {
--- End diff --

Same question was in my mind. thanks for clarification.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22112
  
**[Test build #95129 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95129/testReport)**
 for PR 22112 at commit 
[`93f37fa`](https://github.com/apache/spark/commit/93f37fa585462b9ee2fb9e179eab736fbc416d3e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22171: [SPARK-25177][SQL] When dataframe decimal type co...

2018-08-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22171#discussion_r212180992
  
--- Diff: sql/core/src/test/resources/sql-tests/results/literals.sql.out ---
@@ -197,7 +197,7 @@ select .e3
 -- !query 20
 select 1E309, -1E309
 -- !query 20 schema
-struct<1E+309:decimal(1,-309),-1E+309:decimal(1,-309)>

+struct<10:decimal(1,-309),-10:decimal(1,-309)>
--- End diff --

@vinodkc how does it show in Postgres?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20345
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95131/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20345
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20345
  
**[Test build #95131 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95131/testReport)**
 for PR 20345 at commit 
[`39462fb`](https://github.com/apache/spark/commit/39462fbee952ec574b4c04d7718fd73bb5f56d9d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21546
  
**[Test build #95137 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95137/testReport)**
 for PR 21546 at commit 
[`5549644`](https://github.com/apache/spark/commit/554964465dbcb99cc313620fafb0fc41acfd4304).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/Hive t...

2018-08-22 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22153
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21546
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2472/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21546
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...

2018-08-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22157#discussion_r212178321
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
 ---
@@ -562,20 +562,57 @@ abstract class OrcQueryTest extends OrcTest {
   }
 }
 
+def testAllCorruptFiles(): Unit = {
+  withTempDir { dir =>
+val basePath = dir.getCanonicalPath
+spark.range(1).toDF("a").write.json(new Path(basePath, 
"first").toString)
+spark.range(1, 2).toDF("a").write.json(new Path(basePath, 
"second").toString)
+val df = spark.read.orc(
+  new Path(basePath, "first").toString,
+  new Path(basePath, "second").toString)
+assert(df.count() == 0)
+  }
+}
+
+def testAllCorruptFilesWithoutSchemaInfer(): Unit = {
+  withTempDir { dir =>
+val basePath = dir.getCanonicalPath
+spark.range(1).toDF("a").write.json(new Path(basePath, 
"first").toString)
+spark.range(1, 2).toDF("a").write.json(new Path(basePath, 
"second").toString)
+val df = spark.read.schema("a long").orc(
+  new Path(basePath, "first").toString,
+  new Path(basePath, "second").toString)
+assert(df.count() == 0)
+  }
+}
+
 withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "true") {
   testIgnoreCorruptFiles()
   testIgnoreCorruptFilesWithoutSchemaInfer()
+  val m1 = intercept[AnalysisException] {
+testAllCorruptFiles()
+  }.getMessage
+  assert(m1.contains("Unable to infer schema for ORC"))
+  testAllCorruptFilesWithoutSchemaInfer()
 }
 
 withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "false") {
   val m1 = intercept[SparkException] {
 testIgnoreCorruptFiles()
   }.getMessage
-  assert(m1.contains("Could not read footer for file"))
+  assert(m1.contains("Malformed ORC file"))
--- End diff --

why the error message changed?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...

2018-08-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21546
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...

2018-08-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21546#discussion_r212178291
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
 ---
@@ -183,34 +178,106 @@ private[sql] object ArrowConverters {
   }
 
   /**
-   * Convert a byte array to an ArrowRecordBatch.
+   * Load a serialized ArrowRecordBatch.
*/
-  private[arrow] def byteArrayToBatch(
+  private[arrow] def loadBatch(
   batchBytes: Array[Byte],
   allocator: BufferAllocator): ArrowRecordBatch = {
-val in = new ByteArrayReadableSeekableByteChannel(batchBytes)
-val reader = new ArrowFileReader(in, allocator)
-
-// Read a batch from a byte stream, ensure the reader is closed
-Utils.tryWithSafeFinally {
-  val root = reader.getVectorSchemaRoot  // throws IOException
-  val unloader = new VectorUnloader(root)
-  reader.loadNextBatch()  // throws IOException
-  unloader.getRecordBatch
-} {
-  reader.close()
-}
+val in = new ByteArrayInputStream(batchBytes)
+MessageSerializer.deserializeRecordBatch(
+  new ReadChannel(Channels.newChannel(in)), allocator)  // throws 
IOException
   }
 
+  /**
+   * Create a DataFrame from a JavaRDD of serialized ArrowRecordBatches.
+   */
   private[sql] def toDataFrame(
-  payloadRDD: JavaRDD[Array[Byte]],
+  arrowBatchRDD: JavaRDD[Array[Byte]],
   schemaString: String,
   sqlContext: SQLContext): DataFrame = {
-val rdd = payloadRDD.rdd.mapPartitions { iter =>
+val schema = DataType.fromJson(schemaString).asInstanceOf[StructType]
+val timeZoneId = sqlContext.sessionState.conf.sessionLocalTimeZone
+val rdd = arrowBatchRDD.rdd.mapPartitions { iter =>
   val context = TaskContext.get()
-  ArrowConverters.fromPayloadIterator(iter.map(new ArrowPayload(_)), 
context)
+  ArrowConverters.fromBatchIterator(iter, schema, timeZoneId, context)
 }
-val schema = DataType.fromJson(schemaString).asInstanceOf[StructType]
 sqlContext.internalCreateDataFrame(rdd, schema)
   }
+
+  /**
+   * Read a file as an Arrow stream and parallelize as an RDD of 
serialized ArrowRecordBatches.
+   */
+  private[sql] def readArrowStreamFromFile(
+  sqlContext: SQLContext,
+  filename: String): JavaRDD[Array[Byte]] = {
+val fileStream = new FileInputStream(filename)
+try {
+  // Create array so that we can safely close the file
+  val batches = getBatchesFromStream(fileStream.getChannel).toArray
+  // Parallelize the record batches to create an RDD
+  JavaRDD.fromRDD(sqlContext.sparkContext.parallelize(batches, 
batches.length))
--- End diff --

Ah, sorry. You are right. I misread.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22192
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22192
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95136/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22192
  
**[Test build #95136 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95136/testReport)**
 for PR 22192 at commit 
[`44454dd`](https://github.com/apache/spark/commit/44454dd586e35bdf16492c4a8969494bd3b7f8f5).
 * This patch **fails Java style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  .doc(\"Comma-separated list of class names for \"plugins\" 
implementing \" +`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95128/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22112
  
**[Test build #95128 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95128/testReport)**
 for PR 22112 at commit 
[`097092b`](https://github.com/apache/spark/commit/097092be4b2967689082af62715ecc4f78086c30).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22192: [SPARK-24918] Executor Plugin API

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22192
  
**[Test build #95136 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95136/testReport)**
 for PR 22192 at commit 
[`44454dd`](https://github.com/apache/spark/commit/44454dd586e35bdf16492c4a8969494bd3b7f8f5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22192: [SPARK-24918] Executor Plugin API

2018-08-22 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/22192
  
Jenkins, ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21923: [SPARK-24918][Core] Executor Plugin api

2018-08-22 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/21923
  
this is being continued in https://github.com/apache/spark/pull/22192


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21923: [SPARK-24918][Core] Executor Plugin api

2018-08-22 Thread squito
Github user squito closed the pull request at:

https://github.com/apache/spark/pull/21923


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22195: [SPARK-25205][CORE] Fix typo in spark.network.crypto.key...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22195
  
**[Test build #95135 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95135/testReport)**
 for PR 22195 at commit 
[`b927b94`](https://github.com/apache/spark/commit/b927b94ea1312ba74d73c203adb7683b2fb42fed).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...

2018-08-22 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/21546#discussion_r212171980
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
 ---
@@ -183,34 +178,106 @@ private[sql] object ArrowConverters {
   }
 
   /**
-   * Convert a byte array to an ArrowRecordBatch.
+   * Load a serialized ArrowRecordBatch.
*/
-  private[arrow] def byteArrayToBatch(
+  private[arrow] def loadBatch(
   batchBytes: Array[Byte],
   allocator: BufferAllocator): ArrowRecordBatch = {
-val in = new ByteArrayReadableSeekableByteChannel(batchBytes)
-val reader = new ArrowFileReader(in, allocator)
-
-// Read a batch from a byte stream, ensure the reader is closed
-Utils.tryWithSafeFinally {
-  val root = reader.getVectorSchemaRoot  // throws IOException
-  val unloader = new VectorUnloader(root)
-  reader.loadNextBatch()  // throws IOException
-  unloader.getRecordBatch
-} {
-  reader.close()
-}
+val in = new ByteArrayInputStream(batchBytes)
+MessageSerializer.deserializeRecordBatch(
+  new ReadChannel(Channels.newChannel(in)), allocator)  // throws 
IOException
   }
 
+  /**
+   * Create a DataFrame from a JavaRDD of serialized ArrowRecordBatches.
+   */
   private[sql] def toDataFrame(
-  payloadRDD: JavaRDD[Array[Byte]],
+  arrowBatchRDD: JavaRDD[Array[Byte]],
   schemaString: String,
   sqlContext: SQLContext): DataFrame = {
-val rdd = payloadRDD.rdd.mapPartitions { iter =>
+val schema = DataType.fromJson(schemaString).asInstanceOf[StructType]
+val timeZoneId = sqlContext.sessionState.conf.sessionLocalTimeZone
+val rdd = arrowBatchRDD.rdd.mapPartitions { iter =>
   val context = TaskContext.get()
-  ArrowConverters.fromPayloadIterator(iter.map(new ArrowPayload(_)), 
context)
+  ArrowConverters.fromBatchIterator(iter, schema, timeZoneId, context)
 }
-val schema = DataType.fromJson(schemaString).asInstanceOf[StructType]
 sqlContext.internalCreateDataFrame(rdd, schema)
   }
+
+  /**
+   * Read a file as an Arrow stream and parallelize as an RDD of 
serialized ArrowRecordBatches.
+   */
+  private[sql] def readArrowStreamFromFile(
+  sqlContext: SQLContext,
+  filename: String): JavaRDD[Array[Byte]] = {
+val fileStream = new FileInputStream(filename)
+try {
+  // Create array so that we can safely close the file
+  val batches = getBatchesFromStream(fileStream.getChannel).toArray
+  // Parallelize the record batches to create an RDD
+  JavaRDD.fromRDD(sqlContext.sparkContext.parallelize(batches, 
batches.length))
--- End diff --

so this the length of the array of batches, not the number of records in 
the batch.  The input is split according to the default parallelism config.  So 
if that is 32, we will have an array of 32 batches and then parallelize those 
to 32 partitions. `parallelize` might usually have one big array of primitives 
as the first arg, that you then partition by the number in the second arg, but 
this is a little different since we are using batches. Does that answer your 
question?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22195: [SPARK-25205][CORE] Fix typo in spark.network.crypto.key...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22195
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2471/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22195: [SPARK-25205][CORE] Fix typo in spark.network.crypto.key...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22195
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22161: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes for R sql...

2018-08-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22161
  
Ah, it's okie. Yes, please. Not a big deal.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22195: [CORE] Fix typo in spark.network.crypto.keyFactor...

2018-08-22 Thread squito
GitHub user squito opened a pull request:

https://github.com/apache/spark/pull/22195

[CORE] Fix typo in spark.network.crypto.keyFactoryIterations



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/squito/spark SPARK-25205

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22195.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22195


commit b927b94ea1312ba74d73c203adb7683b2fb42fed
Author: Imran Rashid 
Date:   2018-08-23T03:11:40Z

[CORE] Fix typo in spark.network.crypto.keyFactoryIterations




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...

2018-08-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22194
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...

2018-08-22 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/21546#discussion_r212170997
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -3268,13 +3268,49 @@ class Dataset[T] private[sql](
   }
 
   /**
-   * Collect a Dataset as ArrowPayload byte arrays and serve to PySpark.
+   * Collect a Dataset as Arrow batches and serve stream to PySpark.
*/
   private[sql] def collectAsArrowToPython(): Array[Any] = {
+val timeZoneId = sparkSession.sessionState.conf.sessionLocalTimeZone
+
 withAction("collectAsArrowToPython", queryExecution) { plan =>
-  val iter: Iterator[Array[Byte]] =
-toArrowPayload(plan).collect().iterator.map(_.asPythonSerializable)
-  PythonRDD.serveIterator(iter, "serve-Arrow")
+  PythonRDD.serveToStream("serve-Arrow") { out =>
+val batchWriter = new ArrowBatchStreamWriter(schema, out, 
timeZoneId)
+val arrowBatchRdd = toArrowBatchRdd(plan)
+val numPartitions = arrowBatchRdd.partitions.length
+
+// Store collection results for worst case of 1 to N-1 partitions
--- End diff --

It's not necessary to buffer the first partition because it can be sent to 
Python right away, so only need an array of size N-1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...

2018-08-22 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/21546#discussion_r212171051
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
 ---
@@ -183,34 +178,106 @@ private[sql] object ArrowConverters {
   }
 
   /**
-   * Convert a byte array to an ArrowRecordBatch.
+   * Load a serialized ArrowRecordBatch.
*/
-  private[arrow] def byteArrayToBatch(
+  private[arrow] def loadBatch(
   batchBytes: Array[Byte],
   allocator: BufferAllocator): ArrowRecordBatch = {
-val in = new ByteArrayReadableSeekableByteChannel(batchBytes)
-val reader = new ArrowFileReader(in, allocator)
-
-// Read a batch from a byte stream, ensure the reader is closed
-Utils.tryWithSafeFinally {
-  val root = reader.getVectorSchemaRoot  // throws IOException
-  val unloader = new VectorUnloader(root)
-  reader.loadNextBatch()  // throws IOException
-  unloader.getRecordBatch
-} {
-  reader.close()
-}
+val in = new ByteArrayInputStream(batchBytes)
+MessageSerializer.deserializeRecordBatch(
+  new ReadChannel(Channels.newChannel(in)), allocator)  // throws 
IOException
   }
 
+  /**
+   * Create a DataFrame from a JavaRDD of serialized ArrowRecordBatches.
+   */
   private[sql] def toDataFrame(
-  payloadRDD: JavaRDD[Array[Byte]],
+  arrowBatchRDD: JavaRDD[Array[Byte]],
   schemaString: String,
   sqlContext: SQLContext): DataFrame = {
-val rdd = payloadRDD.rdd.mapPartitions { iter =>
+val schema = DataType.fromJson(schemaString).asInstanceOf[StructType]
+val timeZoneId = sqlContext.sessionState.conf.sessionLocalTimeZone
+val rdd = arrowBatchRDD.rdd.mapPartitions { iter =>
   val context = TaskContext.get()
-  ArrowConverters.fromPayloadIterator(iter.map(new ArrowPayload(_)), 
context)
+  ArrowConverters.fromBatchIterator(iter, schema, timeZoneId, context)
 }
-val schema = DataType.fromJson(schemaString).asInstanceOf[StructType]
 sqlContext.internalCreateDataFrame(rdd, schema)
   }
+
+  /**
+   * Read a file as an Arrow stream and parallelize as an RDD of 
serialized ArrowRecordBatches.
+   */
+  private[sql] def readArrowStreamFromFile(
+  sqlContext: SQLContext,
+  filename: String): JavaRDD[Array[Byte]] = {
+val fileStream = new FileInputStream(filename)
--- End diff --

yup, thanks for catching that


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...

2018-08-22 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/21546#discussion_r212170606
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
 ---
@@ -183,34 +178,106 @@ private[sql] object ArrowConverters {
   }
 
   /**
-   * Convert a byte array to an ArrowRecordBatch.
+   * Load a serialized ArrowRecordBatch.
*/
-  private[arrow] def byteArrayToBatch(
+  private[arrow] def loadBatch(
   batchBytes: Array[Byte],
   allocator: BufferAllocator): ArrowRecordBatch = {
-val in = new ByteArrayReadableSeekableByteChannel(batchBytes)
-val reader = new ArrowFileReader(in, allocator)
-
-// Read a batch from a byte stream, ensure the reader is closed
-Utils.tryWithSafeFinally {
-  val root = reader.getVectorSchemaRoot  // throws IOException
-  val unloader = new VectorUnloader(root)
-  reader.loadNextBatch()  // throws IOException
-  unloader.getRecordBatch
-} {
-  reader.close()
-}
+val in = new ByteArrayInputStream(batchBytes)
+MessageSerializer.deserializeRecordBatch(
+  new ReadChannel(Channels.newChannel(in)), allocator)  // throws 
IOException
   }
 
+  /**
+   * Create a DataFrame from a JavaRDD of serialized ArrowRecordBatches.
+   */
   private[sql] def toDataFrame(
-  payloadRDD: JavaRDD[Array[Byte]],
+  arrowBatchRDD: JavaRDD[Array[Byte]],
   schemaString: String,
   sqlContext: SQLContext): DataFrame = {
-val rdd = payloadRDD.rdd.mapPartitions { iter =>
+val schema = DataType.fromJson(schemaString).asInstanceOf[StructType]
+val timeZoneId = sqlContext.sessionState.conf.sessionLocalTimeZone
+val rdd = arrowBatchRDD.rdd.mapPartitions { iter =>
   val context = TaskContext.get()
-  ArrowConverters.fromPayloadIterator(iter.map(new ArrowPayload(_)), 
context)
+  ArrowConverters.fromBatchIterator(iter, schema, timeZoneId, context)
 }
-val schema = DataType.fromJson(schemaString).asInstanceOf[StructType]
 sqlContext.internalCreateDataFrame(rdd, schema)
   }
+
+  /**
+   * Read a file as an Arrow stream and parallelize as an RDD of 
serialized ArrowRecordBatches.
+   */
+  private[sql] def readArrowStreamFromFile(
+  sqlContext: SQLContext,
+  filename: String): JavaRDD[Array[Byte]] = {
+val fileStream = new FileInputStream(filename)
+try {
+  // Create array so that we can safely close the file
+  val batches = getBatchesFromStream(fileStream.getChannel).toArray
+  // Parallelize the record batches to create an RDD
+  JavaRDD.fromRDD(sqlContext.sparkContext.parallelize(batches, 
batches.length))
+} finally {
+  fileStream.close()
+}
+  }
+
+  /**
+   * Read an Arrow stream input and return an iterator of serialized 
ArrowRecordBatches.
+   */
+  private[sql] def getBatchesFromStream(in: SeekableByteChannel): 
Iterator[Array[Byte]] = {
+
+// Create an iterator to get each serialized ArrowRecordBatch from a 
stream
+new Iterator[Array[Byte]] {
+  var batch: Array[Byte] = readNextBatch()
+
+  override def hasNext: Boolean = batch != null
+
+  override def next(): Array[Byte] = {
+val prevBatch = batch
+batch = readNextBatch()
+prevBatch
+  }
+
+  def readNextBatch(): Array[Byte] = {
+val msgMetadata = MessageSerializer.readMessage(new 
ReadChannel(in))
+if (msgMetadata == null) {
+  return null
+}
+
+// Get the length of the body, which has not be read at this point
+val bodyLength = msgMetadata.getMessageBodyLength.toInt
+
+// Only care about RecordBatch data, skip Schema and unsupported 
Dictionary messages
+if (msgMetadata.getMessage.headerType() == 
MessageHeader.RecordBatch) {
+
+  // Create output backed by buffer to hold msg length (int32), 
msg metadata, msg body
+  val bbout = new ByteBufferOutputStream(4 + 
msgMetadata.getMessageLength + bodyLength)
--- End diff --

I'll add some more details about what this is doing


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...

2018-08-22 Thread techaddict
Github user techaddict commented on the issue:

https://github.com/apache/spark/pull/22194
  
@ueshin LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22161: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes fo...

2018-08-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22161


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22161: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes for R sql...

2018-08-22 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22161
  
@HyukjinKwon Oh.. thank you. I was going to fix the style ? I will include 
it when i fix something next ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22161: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes for R sql...

2018-08-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22161
  
LGTM. 

Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22161: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes fo...

2018-08-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22161#discussion_r212168906
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -3613,11 +3613,11 @@ test_that("Collect on DataFrame when NAs exists at 
the top of a timestamp column
 test_that("catalog APIs, currentDatabase, setCurrentDatabase, 
listDatabases", {
   expect_equal(currentDatabase(), "default")
   expect_error(setCurrentDatabase("default"), NA)
-  expect_error(setCurrentDatabase("foo"),
-   "Error in setCurrentDatabase : analysis error - Database 
'foo' does not exist")
+  expect_error(setCurrentDatabase("zxwtyswklpf"),
+"Error in setCurrentDatabase : analysis error - Database 
'zxwtyswklpf' does not exist")
--- End diff --

Not a big deal but can we fix this style too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22085: [SPARK-25095][PySpark] Python support for BarrierTaskCon...

2018-08-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22085
  
Thanks, @jiangxb1987 and @mengxr again.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22085: [SPARK-25095][PySpark] Python support for Barrier...

2018-08-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22085#discussion_r212168597
  
--- Diff: python/pyspark/taskcontext.py ---
@@ -95,3 +99,143 @@ def getLocalProperty(self, key):
 Get a local property set upstream in the driver, or None if it is 
missing.
 """
 return self._localProperties.get(key, None)
+
+
+BARRIER_FUNCTION = 1
+
+
+def _load_from_socket(port, auth_secret):
+"""
+Load data from a given socket, this is a blocking method thus only 
return when the socket
+connection has been closed.
+
+This is copied from context.py, while modified the message protocol.
--- End diff --

It would be nicer if we can deduplciate it later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22163: [SPARK-25166][CORE]Reduce the number of write ope...

2018-08-22 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22163#discussion_r212168161
  
--- Diff: 
core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java ---
@@ -206,14 +211,21 @@ private void writeSortedFile(boolean isLastFile) {
   long recordReadPosition = recordOffsetInPage + uaoSize; // skip over 
record length
   while (dataRemaining > 0) {
 final int toTransfer = Math.min(diskWriteBufferSize, 
dataRemaining);
-Platform.copyMemory(
-  recordPage, recordReadPosition, writeBuffer, 
Platform.BYTE_ARRAY_OFFSET, toTransfer);
-writer.write(writeBuffer, 0, toTransfer);
+if (bufferOffset > 0 && bufferOffset + toTransfer > 
DISK_WRITE_BUFFER_SIZE) {
--- End diff --

Yes,you are right.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22189: Correct missing punctuation in the documentation

2018-08-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22189
  
It's okay but mind if I ask to take another look, see if there are more 
typos and fix other typos while we are here? I am pretty sure there are more.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22163: [SPARK-25166][CORE]Reduce the number of write ope...

2018-08-22 Thread Ngone51
Github user Ngone51 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22163#discussion_r212167438
  
--- Diff: 
core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java ---
@@ -206,14 +211,21 @@ private void writeSortedFile(boolean isLastFile) {
   long recordReadPosition = recordOffsetInPage + uaoSize; // skip over 
record length
   while (dataRemaining > 0) {
 final int toTransfer = Math.min(diskWriteBufferSize, 
dataRemaining);
-Platform.copyMemory(
-  recordPage, recordReadPosition, writeBuffer, 
Platform.BYTE_ARRAY_OFFSET, toTransfer);
-writer.write(writeBuffer, 0, toTransfer);
+if (bufferOffset > 0 && bufferOffset + toTransfer > 
DISK_WRITE_BUFFER_SIZE) {
--- End diff --

> The numRecordsWritten in DiskBlockObjectWriter is still correct during 
the process after this PR

The number is correct, but it is not consistent with what real happen 
compare to current behaviour.  But as you said, we will get correct result at 
the end. So, it may not be a big deal.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22191: [SPARK-25204][SS] Fix race in rate source test.

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22191
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95126/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22191: [SPARK-25204][SS] Fix race in rate source test.

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22191
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22191: [SPARK-25204][SS] Fix race in rate source test.

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22191
  
**[Test build #95126 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95126/testReport)**
 for PR 22191 at commit 
[`ade656a`](https://github.com/apache/spark/commit/ade656a3cf3edb32d80e66d03276584c0c86c4d0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22163: [SPARK-25166][CORE]Reduce the number of write ope...

2018-08-22 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22163#discussion_r212166322
  
--- Diff: 
core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java ---
@@ -206,14 +211,21 @@ private void writeSortedFile(boolean isLastFile) {
   long recordReadPosition = recordOffsetInPage + uaoSize; // skip over 
record length
   while (dataRemaining > 0) {
 final int toTransfer = Math.min(diskWriteBufferSize, 
dataRemaining);
-Platform.copyMemory(
-  recordPage, recordReadPosition, writeBuffer, 
Platform.BYTE_ARRAY_OFFSET, toTransfer);
-writer.write(writeBuffer, 0, toTransfer);
+if (bufferOffset > 0 && bufferOffset + toTransfer > 
DISK_WRITE_BUFFER_SIZE) {
--- End diff --

The `numRecordsWritten`  in `DiskBlockObjectWriter`  is still  correct 
during the process after this PR


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22187: [SPARK-25178][SQL] Directly ship the StructType objects ...

2018-08-22 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22187
  
+1 LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22171: [SPARK-25177][SQL] When dataframe decimal type co...

2018-08-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22171#discussion_r212165658
  
--- Diff: sql/core/src/test/resources/sql-tests/results/literals.sql.out ---
@@ -197,7 +197,7 @@ select .e3
 -- !query 20
 select 1E309, -1E309
 -- !query 20 schema
-struct<1E+309:decimal(1,-309),-1E+309:decimal(1,-309)>

+struct<10:decimal(1,-309),-10:decimal(1,-309)>
--- End diff --

hmm, this seems a bad representation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22171: [SPARK-25177][SQL] When dataframe decimal type co...

2018-08-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22171#discussion_r212165521
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out ---
@@ -201,6 +201,7 @@ struct<>
 -- !query 20 output
 
 
+
--- End diff --

I think this is wrongly submitted.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22163: [SPARK-25166][CORE]Reduce the number of write ope...

2018-08-22 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22163#discussion_r212165385
  
--- Diff: 
core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java ---
@@ -206,14 +211,21 @@ private void writeSortedFile(boolean isLastFile) {
   long recordReadPosition = recordOffsetInPage + uaoSize; // skip over 
record length
   while (dataRemaining > 0) {
 final int toTransfer = Math.min(diskWriteBufferSize, 
dataRemaining);
-Platform.copyMemory(
-  recordPage, recordReadPosition, writeBuffer, 
Platform.BYTE_ARRAY_OFFSET, toTransfer);
-writer.write(writeBuffer, 0, toTransfer);
+if (bufferOffset > 0 && bufferOffset + toTransfer > 
DISK_WRITE_BUFFER_SIZE) {
--- End diff --

Yeah,   this result of `_bytesWritten` would not have been updated 
synchronously before, you can see this condition:`if (numRecordsWritten % 16384 
== 0)`.
 But we do not  need worry. the final result is correct, because it will be 
updated in `commitAndGet`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/Hive t...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22153
  
**[Test build #95134 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95134/testReport)**
 for PR 22153 at commit 
[`82190ac`](https://github.com/apache/spark/commit/82190accc6182988dfae00a07a656b069aa7b708).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/Hive t...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22153
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2470/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/Hive t...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22153
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22180: [SPARK-25174][YARN]Limit the size of diagnostic message ...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22180
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95132/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22180: [SPARK-25174][YARN]Limit the size of diagnostic message ...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22180
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22180: [SPARK-25174][YARN]Limit the size of diagnostic message ...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22180
  
**[Test build #95132 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95132/testReport)**
 for PR 22180 at commit 
[`3271c3f`](https://github.com/apache/spark/commit/3271c3f4100e0d69fe30400a42ab35aaab1c7c48).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22187: [SPARK-25178][SQL] Directly ship the StructType o...

2018-08-22 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22187#discussion_r212164362
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/RowBasedHashMapGenerator.scala
 ---
@@ -44,31 +44,8 @@ class RowBasedHashMapGenerator(
 groupingKeySchema, bufferSchema) {
 
   override protected def initializeAggregateHashMap(): String = {
-val generatedKeySchema: String =
-  s"new org.apache.spark.sql.types.StructType()" +
-groupingKeySchema.map { key =>
-  val keyName = ctx.addReferenceObj("keyName", key.name)
-  key.dataType match {
-case d: DecimalType =>
-  s""".add($keyName, 
org.apache.spark.sql.types.DataTypes.createDecimalType(
-  |${d.precision}, ${d.scale}))""".stripMargin
-case _ =>
-  s""".add($keyName, 
org.apache.spark.sql.types.DataTypes.${key.dataType})"""
-  }
-}.mkString("\n").concat(";")
-
-val generatedValueSchema: String =
-  s"new org.apache.spark.sql.types.StructType()" +
-bufferSchema.map { key =>
-  val keyName = ctx.addReferenceObj("keyName", key.name)
-  key.dataType match {
-case d: DecimalType =>
-  s""".add($keyName, 
org.apache.spark.sql.types.DataTypes.createDecimalType(
-  |${d.precision}, ${d.scale}))""".stripMargin
-case _ =>
-  s""".add($keyName, 
org.apache.spark.sql.types.DataTypes.${key.dataType})"""
-  }
-}.mkString("\n").concat(";")
+val generatedKeySchema = ctx.addReferenceObj("generatedKeySchemaTerm", 
groupingKeySchema)
--- End diff --

nit: the variable name sounds strange because the schema is not generated 
any more.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22171: [SPARK-25177][SQL] When dataframe decimal type column ha...

2018-08-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22171
  
@rxin, I recall https://github.com/apache/spark/pull/14560 where we used 
Postgres as reference. WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22194
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...

2018-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22194
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2469/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...

2018-08-22 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/22153
  
my bad, this pr doesn't affect cache tables in webui. I'll drop these.
Actually, this affects hive tables and rdds only;
```
>> Hive table case
sql("CREATE TABLE t(c1 int) USING hive")
sql("INSERT INTO t VALUES(1)")
spark.table("t").show()

>> RDD case
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val data = spark.sparkContext.parallelize(Row(1, "abc") :: 
Nil).setName("existing RDD1")
val df = spark.createDataFrame(data, StructType.fromDDL("c0 int, c1 
string"))
df.show()
```
> spark-v2.3.1 for hive tables
https://user-images.githubusercontent.com/692303/44500677-cb55d180-a6c4-11e8-97e9-25b88b351b0a.png;>

> master w/this pr for hive tables
https://user-images.githubusercontent.com/692303/44500676-cb55d180-a6c4-11e8-9602-1cfbea6d8267.png;>

> spark-v2.3.1 for rdds
https://user-images.githubusercontent.com/692303/44500731-05bf6e80-a6c5-11e8-83dd-ed7f1ab2d658.png;>

> master w/this pr for rdds
https://user-images.githubusercontent.com/692303/44500741-11ab3080-a6c5-11e8-8c18-e1cc66be0f09.png;>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...

2018-08-22 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/22194
  
cc @techaddict


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...

2018-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22194
  
**[Test build #95133 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95133/testReport)**
 for PR 22194 at commit 
[`967360a`](https://github.com/apache/spark/commit/967360a1a417739cdada3b7c7334c8ca87ede6a6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22185: [SPARK-25127] DataSourceV2: Remove SupportsPushDownCatal...

2018-08-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22185
  
+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of z...

2018-08-22 Thread ueshin
GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/22194

[SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with function.

## What changes were proposed in this pull request?

This is a follow-up pr of #22031 which added `zip_with` function to fix an 
example.

## How was this patch tested?

Existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark 
issues/SPARK-23932/fix_examples

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22194.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22194


commit 967360a1a417739cdada3b7c7334c8ca87ede6a6
Author: Takuya UESHIN 
Date:   2018-08-23T01:59:55Z

Fix an example.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22163: [SPARK-25166][CORE]Reduce the number of write ope...

2018-08-22 Thread Ngone51
Github user Ngone51 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22163#discussion_r212163785
  
--- Diff: 
core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java ---
@@ -206,14 +211,21 @@ private void writeSortedFile(boolean isLastFile) {
   long recordReadPosition = recordOffsetInPage + uaoSize; // skip over 
record length
   while (dataRemaining > 0) {
 final int toTransfer = Math.min(diskWriteBufferSize, 
dataRemaining);
-Platform.copyMemory(
-  recordPage, recordReadPosition, writeBuffer, 
Platform.BYTE_ARRAY_OFFSET, toTransfer);
-writer.write(writeBuffer, 0, toTransfer);
+if (bufferOffset > 0 && bufferOffset + toTransfer > 
DISK_WRITE_BUFFER_SIZE) {
--- End diff --

Yeah, I agree there' s no difference as for final result. But 
`writeMetrics` in `DiskBlockObjectWriter` would be incorrect during the 
process. Not only `numRecordsWritten`, but also `_bytesWritten`(this could only 
be correctly counted when `writer.write()` is called. You can see 
`recordWritten#updateBytesWritten` for detail).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22164: [SPARK-23679][YARN] Fix AmIpFilter cannot work in RM HA ...

2018-08-22 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/22164
  
Gently ping again @vanzin @tgravescs . Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...

2018-08-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21546
  
LGTM otherwise.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   >