[GitHub] spark pull request #22540: [SPARK-24324] [PYTHON] [FOLLOW-UP] Rename the Con...

2018-09-24 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22540#discussion_r220065589
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala 
---
@@ -131,11 +131,8 @@ object ArrowUtils {
 } else {
   Nil
 }
-val pandasColsByPosition = if 
(conf.pandasGroupedMapAssignColumnssByPosition) {
--- End diff --

This is fragile. The performance impact is big?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22540: [SPARK-24324] [PYTHON] [FOLLOW-UP] Rename the Con...

2018-09-24 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22540#discussion_r220065308
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1295,15 +1295,15 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
-  val PANDAS_GROUPED_MAP_ASSIGN_COLUMNS_BY_POSITION =
-
buildConf("spark.sql.execution.pandas.groupedMap.assignColumnsByPosition")
+  val PANDAS_GROUPED_MAP_ASSIGN_COLUMNS_BY_NAME =
+
buildConf("spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName")
--- End diff --

Actually, we are not following this in the other SQLConf. I think either is 
fine. We will remove these SQLConf in Spark 3.0 release. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22198: [SPARK-25121][SQL] Supports multi-part table names for b...

2018-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22198
  
**[Test build #96536 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96536/testReport)**
 for PR 22198 at commit 
[`9a6da27`](https://github.com/apache/spark/commit/9a6da2750554cd03ce468d6bfe0aeba8b75883b8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22198: [SPARK-25121][SQL] Supports multi-part table names for b...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22198
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22198: [SPARK-25121][SQL] Supports multi-part table names for b...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22198
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3422/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22198: [SPARK-25121][SQL] Supports multi-part table names for b...

2018-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22198
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22540: [SPARK-24324] [PYTHON] [FOLLOW-UP] Rename the Conf to sp...

2018-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22540
  
Change itself LGTM except that bool comparison one


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22466: [SPARK-25464][SQL]On dropping the Database it will drop ...

2018-09-24 Thread sandeep-katta
Github user sandeep-katta commented on the issue:

https://github.com/apache/spark/pull/22466
  
seems @cloud-fan  comments are valid as it will not result in any behavior 
change, I will update the PR accordingly WDYT @srowen 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22541: [SPARK-23907][SQL] Revert regr_* functions entirely

2018-09-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22541
  
LGTM, these functions have weird names and looks not very useful.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22507: [SPARK-25495][SS]FetchedData.reset should reset a...

2018-09-24 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/22507#discussion_r220061053
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala
 ---
@@ -874,6 +874,57 @@ abstract class KafkaMicroBatchSourceSuiteBase extends 
KafkaSourceSuiteBase {
   )
 }
   }
+
+  test("SPARK-25495: FetchedData.reset should reset all fields") {
+val topic = newTopic()
+val topicPartition = new TopicPartition(topic, 0)
+testUtils.createTopic(topic, partitions = 1)
+
+val ds = spark
+  .readStream
+  .format("kafka")
+  .option("kafka.bootstrap.servers", testUtils.brokerAddress)
+  .option("kafka.metadata.max.age.ms", "1")
+  .option("kafka.isolation.level", "read_committed")
+  .option("subscribe", topic)
+  .option("startingOffsets", "earliest")
+  .load()
+  .select($"value".as[String])
+
+testUtils.withTranscationalProducer { producer =>
+  producer.beginTransaction()
+  (0 to 3).foreach { i =>
+producer.send(new ProducerRecord[String, String](topic, 
i.toString)).get()
+  }
+  producer.commitTransaction()
+}
+testUtils.waitUntilOffsetAppears(topicPartition, 5)
+
+val q = ds.writeStream.foreachBatch { (ds, epochId) =>
+  if (epochId == 0) {
+// Post more messages to Kafka so that the executors will fetch 
messages in the next batch
+// and drop them. In this case, if we forget to reset 
`FetchedData._nextOffsetInFetchedData`
--- End diff --

Make it clear that you " want to send more message *before* the tasks of 
the current batch start reading the current batch data, so that the executors 


also, I am not entirely sure how it causes `fetchedData.reset()` thus 
creating the issue. Are you sure this fails without your fix?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22542: [SPARK-25519][SQL] ArrayRemove function may return incor...

2018-09-24 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22542
  
@cloud-fan Thanks a lot. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22484: [SPARK-25476][SPARK-25510][TEST] Refactor AggregateBench...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22484
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22484: [SPARK-25476][SPARK-25510][TEST] Refactor AggregateBench...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22484
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96531/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22484: [SPARK-25476][SPARK-25510][TEST] Refactor AggregateBench...

2018-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22484
  
**[Test build #96531 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96531/testReport)**
 for PR 22484 at commit 
[`d5fecdc`](https://github.com/apache/spark/commit/d5fecdcca267cb34c3d9aaa72d8ea4a3a7a3cc60).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22507: [SPARK-25495][SS]FetchedData.reset should reset a...

2018-09-24 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/22507#discussion_r220059919
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala
 ---
@@ -874,6 +874,57 @@ abstract class KafkaMicroBatchSourceSuiteBase extends 
KafkaSourceSuiteBase {
   )
 }
   }
+
+  test("SPARK-25495: FetchedData.reset should reset all fields") {
+val topic = newTopic()
+val topicPartition = new TopicPartition(topic, 0)
+testUtils.createTopic(topic, partitions = 1)
+
+val ds = spark
+  .readStream
+  .format("kafka")
+  .option("kafka.bootstrap.servers", testUtils.brokerAddress)
+  .option("kafka.metadata.max.age.ms", "1")
+  .option("kafka.isolation.level", "read_committed")
+  .option("subscribe", topic)
+  .option("startingOffsets", "earliest")
+  .load()
+  .select($"value".as[String])
+
+testUtils.withTranscationalProducer { producer =>
+  producer.beginTransaction()
+  (0 to 3).foreach { i =>
+producer.send(new ProducerRecord[String, String](topic, 
i.toString)).get()
+  }
+  producer.commitTransaction()
+}
+testUtils.waitUntilOffsetAppears(topicPartition, 5)
+
+val q = ds.writeStream.foreachBatch { (ds, epochId) =>
+  if (epochId == 0) {
+// Post more messages to Kafka so that the executors will fetch 
messages in the next batch
+// and drop them. In this case, if we forget to reset 
`FetchedData._nextOffsetInFetchedData`
--- End diff --

what do you mean by drop them?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22540: [SPARK-24324] [PYTHON] [FOLLOW-UP] Rename the Con...

2018-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22540#discussion_r220058070
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala 
---
@@ -131,11 +131,8 @@ object ArrowUtils {
 } else {
   Nil
 }
-val pandasColsByPosition = if 
(conf.pandasGroupedMapAssignColumnssByPosition) {
--- End diff --

Eh, also, I think @BryanCutler did this to send less data from JVM to 
Python side. Not a big deal but might better just leave as was.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22466: [SPARK-25464][SQL]On dropping the Database it will drop ...

2018-09-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22466
  
This is a behavior change and makes us different from Hive. However I can't 
find a strong reason to do it. It's like importing a database, but we can't 
automatically create table entries in the metastore when creating a database 
with an existing location.

To me a more reasonable behavior is, fail earlier when creating a database 
with an existing and non-empty location.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22466: [SPARK-25464][SQL]On dropping the Database it will drop ...

2018-09-24 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22466
  
Owp, I've been misreading that several times. Right. Well by analogy, if a 
database has a non default LOCATION then so do it's tables, and they are 
treated like EXTERNAL tables. Dropping the DB means dropping the tables, and 
dropping those tables doesn't delete data. So should the same happen for DBs? 
Seems sensible, because the DB directory might not even be empty.

Still I feel like I'm missing something if it only comes up in the case 
that two DBs have the same location, which is going to cause a bunch of other 
problems. 

But is it the right change simply because it's consistent with how dropping 
tables works?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22541: [SPARK-23907][SQL] Revert regr_* functions entirely

2018-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22541
  
I'm supportive of this change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22524: [SPARK-25497][SQL] Limit operation within whole s...

2018-09-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22524#discussion_r220054697
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -71,22 +71,14 @@ trait BaseLimitExec extends UnaryExecNode with 
CodegenSupport {
   }
 
   override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: 
ExprCode): String = {
-val stopEarly =
-  ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "stopEarly") // init 
as stopEarly = false
-
-ctx.addNewFunction("stopEarly", s"""
-  @Override
-  protected boolean stopEarly() {
-return $stopEarly;
-  }
-""", inlineToOuterClass = true)
 val countTerm = ctx.addMutableState(CodeGenerator.JAVA_INT, "count") 
// init as count = 0
 s"""
| if ($countTerm < $limit) {
|   $countTerm += 1;
+   |   if ($countTerm == $limit) {
+   | setStopEarly(true);
--- End diff --

Actually as I'm just looking at the query again, there should not be a 
`stopEarly` check inside `consume` that prevents us to consume the last record. 
Because the check should be at the outer while loop.

The cases having `stopEarly` check inside `consume`, is blocking operators 
like sort and aggregate, for them we need to reset the flag.

But for safety, I think I will also move this after `consume`.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22540: [SPARK-24324] [PYTHON] [FOLLOW-UP] Rename the Con...

2018-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22540#discussion_r220054060
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1295,15 +1295,15 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
-  val PANDAS_GROUPED_MAP_ASSIGN_COLUMNS_BY_POSITION =
-
buildConf("spark.sql.execution.pandas.groupedMap.assignColumnsByPosition")
+  val PANDAS_GROUPED_MAP_ASSIGN_COLUMNS_BY_NAME =
+
buildConf("spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName")
--- End diff --

Out of curiosity, wouldn't we better leave this `false` by default (which 
means we disable a legacy behvaiour by default)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22542: [SPARK-25519][SQL] ArrayRemove function may return incor...

2018-09-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22542
  
thanks, merging to master/2.4!

@dilipbiswal sorry I didn't see your comment while merging. If the problem 
is about "implicit casting between two Map types", feel free to open a PR and 
let's see if we can backport it. thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22542: [SPARK-25519][SQL] ArrayRemove function may retur...

2018-09-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22542


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22542: [SPARK-25519][SQL] ArrayRemove function may return incor...

2018-09-24 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22542
  
@cloud-fan @ueshin  I would like to ask a question here. There is one more 
function i wanted to fix. Originally i wanted to do it as part of this PR.. 
then realized that its not as straight forward and decided to do it in separate 
PR. The function in question is element_at. It turns out that we don't 
currently handle implicit casting between two Map types if the child types are 
different. say Map to Map. I was planning to enhance 
our implicitCast routine handle casting between MapTypes just like we do for 
ArrayType today. Does that sound okay to you ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22466: [SPARK-25464][SQL]On dropping the Database it will drop ...

2018-09-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22466
  
> That link says Hive does support EXTERNAL. What am I missing?

Hive supports `EXTERNAL` only for tables, not databases.
The CREATE TABLE syntax:
```
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name ...
```
The CREATE DATABASE syntax:
```
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name ...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22484: [SPARK-25476][SPARK-25510][TEST] Refactor AggregateBench...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22484
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96529/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22484: [SPARK-25476][SPARK-25510][TEST] Refactor AggregateBench...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22484
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22511: [SPARK-25422][CORE] Don't memory map blocks streamed to ...

2018-09-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22511
  
a possible approach: can we just not dispose the data in `TorrentBroadcast`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22484: [SPARK-25476][SPARK-25510][TEST] Refactor AggregateBench...

2018-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22484
  
**[Test build #96529 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96529/testReport)**
 for PR 22484 at commit 
[`536d4e9`](https://github.com/apache/spark/commit/536d4e9bb0cc737ee2f7c887763a4c9ebef147f5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22484: [SPARK-25476][SPARK-25510][TEST] Refactor AggregateBench...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22484
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22484: [SPARK-25476][SPARK-25510][TEST] Refactor AggregateBench...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22484
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96530/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22484: [SPARK-25476][SPARK-25510][TEST] Refactor AggregateBench...

2018-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22484
  
**[Test build #96530 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96530/testReport)**
 for PR 22484 at commit 
[`f783099`](https://github.com/apache/spark/commit/f783099f818d3ed509ab95133ff46e2f183444a8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22541: [SPARK-23907][SQL] Revert regr_* functions entirely

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22541
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3421/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22541: [SPARK-23907][SQL] Revert regr_* functions entirely

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22541
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22539: [SPARK-25517][SQL] Detect/Infer date type in CSV file

2018-09-24 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22539
  
Thank you for review, @HyukjinKwon .

@softmanu . Could you take a look at 
[SPARK-19228](https://github.com/apache/spark/pull/21363) and close this PR and 
Apache Spark JIRA?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22511: [SPARK-25422][CORE] Don't memory map blocks streamed to ...

2018-09-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22511
  
The analysis makes sense to me. The thing I'm not sure is, how can we hit 
it? The "fetch block to temp file" code path is only enabled for big blocks (> 
2GB).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22541: [SPARK-23907][SQL] Revert regr_* functions entirely

2018-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22541
  
**[Test build #96535 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96535/testReport)**
 for PR 22541 at commit 
[`ef0f5b0`](https://github.com/apache/spark/commit/ef0f5b02dbce29d6fbaa6f79f9c2ad62e7a16bb0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22541: [SPARK-23907][SQL] Revert regr_* functions entirely

2018-09-24 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22541
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22521: [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_CO...

2018-09-24 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22521
  
nit. Could you add `[CORE]` to the PR title, too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22521: [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_CO...

2018-09-24 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22521
  
yup; just did


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22521: [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_CO...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22521
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3420/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22521: [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_CO...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22521
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22521: [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_CO...

2018-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22521
  
**[Test build #96534 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96534/testReport)**
 for PR 22521 at commit 
[`4af98e7`](https://github.com/apache/spark/commit/4af98e76319cbb363b5646f3cde85a3eca12a6ef).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22521: [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_CO...

2018-09-24 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22521
  
@rxin . Could you fill the PR description, too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22521: [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_CO...

2018-09-24 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22521
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22539: [SPARK-25517][SQL] Detect/Infer date type in CSV file

2018-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22539
  
I think this is a duplicate of https://github.com/apache/spark/pull/21363


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22537: [SPARK-21291][R] add R partitionBy API in DataFra...

2018-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22537#discussion_r220050102
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -2713,8 +2713,16 @@ test_that("read/write text files", {
   expect_equal(colnames(df2), c("value"))
   expect_equal(count(df2), count(df) * 2)
 
+  df3 <- createDataFrame(list(list(1L, "1"), list(2L, "2"), list(1L, "1"), 
list(2L, "2")),
+schema = c("key", "value"))
--- End diff --

one more space


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22537: [SPARK-21291][R] add R partitionBy API in DataFra...

2018-09-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22537#discussion_r220050067
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2985,8 +2988,18 @@ setMethod("write.df",
 if (is.null(source)) {
   source <- getDefaultSqlSource()
 }
+cols <- NULL
+if (!is.null(partitionBy)) {
+  if (!all(sapply(partitionBy, function(c) { is.character(c) 
}))) {
--- End diff --

not a big deal but `function(c) { is.character(c) }` -> `function(c) 
is.character(c)` if more changes should be pushed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22466: [SPARK-25464][SQL]On dropping the Database it will drop ...

2018-09-24 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22466
  
That link says Hive does support EXTERNAL. What am I missing? Well, in any 
event we aren't contemplating a behavior change here. 

If you delete a table with LOCATION specified, what should happen? Hive 
would delete it I guess... unless it's EXTERNAL. Looks like Spark would delete 
it even when it's in an 'external' location. 

If these are conflated I guess we weigh the surprise in not deleting 
'local' locations for data when a table is dropped, vs surprise in deleting 
'external' locations for data.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22484: [SPARK-25476][SPARK-25510][TEST] Refactor AggregateBench...

2018-09-24 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22484
  
BTW, I don't think we need `withTempTable` and `withTempPath` in this PR. 
Those are beyond of the scope of this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22542: [SPARK-25519][SQL] ArrayRemove function may return incor...

2018-09-24 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/22542
  
LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22484: [SPARK-25476][SPARK-25510][TEST] Refactor AggregateBench...

2018-09-24 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22484
  
Yep. This PR already is introducing new `trait` for `Benchmark` as a part 
of grand refactoring plan. I think the other new `trait`s also possible at this 
time. We will see. :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22524: [SPARK-25497][SQL] Limit operation within whole stage co...

2018-09-24 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22524
  
> It will be great to explain how limit works in whole stage codegen, in 
general. This part is a little hard to understand and I believe many operators 
need to deal with limit as well.

Ok. Let me add more explanation into the PR description.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22494: [SPARK-25454][SQL] add a new config for picking minimum ...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22494
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22524: [SPARK-25497][SQL] Limit operation within whole s...

2018-09-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22524#discussion_r220048264
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -71,22 +71,14 @@ trait BaseLimitExec extends UnaryExecNode with 
CodegenSupport {
   }
 
   override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: 
ExprCode): String = {
-val stopEarly =
-  ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "stopEarly") // init 
as stopEarly = false
-
-ctx.addNewFunction("stopEarly", s"""
-  @Override
-  protected boolean stopEarly() {
-return $stopEarly;
-  }
-""", inlineToOuterClass = true)
 val countTerm = ctx.addMutableState(CodeGenerator.JAVA_INT, "count") 
// init as count = 0
 s"""
| if ($countTerm < $limit) {
|   $countTerm += 1;
+   |   if ($countTerm == $limit) {
+   | setStopEarly(true);
--- End diff --

Oh, I see. And I think `shouldStop` shouldn't be called inside `consume`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22494: [SPARK-25454][SQL] add a new config for picking minimum ...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22494
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3419/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22494: [SPARK-25454][SQL] add a new config for picking minimum ...

2018-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22494
  
**[Test build #96533 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96533/testReport)**
 for PR 22494 at commit 
[`faa0bd0`](https://github.com/apache/spark/commit/faa0bd065114ea7f0f9836e36ac40cd311e1983f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22525: [SPARK-25503][CORE][WEBUI]Total task message in s...

2018-09-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22525


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22484: [SPARK-25476][SPARK-25510][TEST] Refactor AggregateBench...

2018-09-24 Thread wangyum
Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22484
  
Thanks @dongjoon-hyun. except `withSQLConf`. We need `withTempTable` and 
`withTempPath`:

https://github.com/apache/spark/blob/d25f425c9652a3611dd5fea8a37df4abb13e126e/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala#L63-L83


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22494: [SPARK-25454][SQL] add a new config for picking minimum ...

2018-09-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22494
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22524: [SPARK-25497][SQL] Limit operation within whole stage co...

2018-09-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22524
  
It will be great to explain how limit works in whole stage codegen, in 
general. This part is a little hard to understand.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22525: [SPARK-25503][CORE][WEBUI]Total task message in stage pa...

2018-09-24 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22525
  
Merged to master/branch-2.4/branch-2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22484: [SPARK-25476][SPARK-25510][TEST] Refactor AggregateBench...

2018-09-24 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22484
  
Also, cc @gatorsmile . Since we need `withSQLConf` in both Benchmark and 
TestSuite, I want to introduce `SupportWithSQLConf` trait .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22524: [SPARK-25497][SQL] Limit operation within whole s...

2018-09-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22524#discussion_r220046724
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -71,22 +71,14 @@ trait BaseLimitExec extends UnaryExecNode with 
CodegenSupport {
   }
 
   override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: 
ExprCode): String = {
-val stopEarly =
-  ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "stopEarly") // init 
as stopEarly = false
-
-ctx.addNewFunction("stopEarly", s"""
-  @Override
-  protected boolean stopEarly() {
-return $stopEarly;
-  }
-""", inlineToOuterClass = true)
 val countTerm = ctx.addMutableState(CodeGenerator.JAVA_INT, "count") 
// init as count = 0
 s"""
| if ($countTerm < $limit) {
|   $countTerm += 1;
+   |   if ($countTerm == $limit) {
+   | setStopEarly(true);
--- End diff --

`if ($countTerm == $limit)` means this is the last record, and we should 
still consume it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22524: [SPARK-25497][SQL] Limit operation within whole s...

2018-09-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22524#discussion_r220046213
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/BufferedRowIterator.java 
---
@@ -73,14 +78,21 @@ public void append(InternalRow row) {
 currentRows.add(row);
   }
 
+  /**
+   * Sets the flag of stopping the query execution early.
+   */
+  public void setStopEarly(boolean value) {
--- End diff --

You also hint me that we should reset stop early flag in sort exec node 
too. I will add it and related test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22484: [SPARK-25476][SPARK-25510][TEST] Refactor AggregateBench...

2018-09-24 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22484
  
@wangyum . If you don't mind, could you review my PR to your branch, 
https://github.com/wangyum/spark/pull/11, which
- deduplicates `withSQLConf` back
- runs on AWS EC2 `r3.xlarge`?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22524: [SPARK-25497][SQL] Limit operation within whole s...

2018-09-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22524#discussion_r220046092
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/BufferedRowIterator.java 
---
@@ -73,14 +78,21 @@ public void append(InternalRow row) {
 currentRows.add(row);
   }
 
+  /**
+   * Sets the flag of stopping the query execution early.
+   */
+  public void setStopEarly(boolean value) {
--- End diff --

Ok. Let me add it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22429
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22429
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96527/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...

2018-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22429
  
**[Test build #96527 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96527/testReport)**
 for PR 22429 at commit 
[`4ec5732`](https://github.com/apache/spark/commit/4ec57326612adb2d650e425d03eae70366480242).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22524: [SPARK-25497][SQL] Limit operation within whole s...

2018-09-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22524#discussion_r220044740
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -71,22 +71,14 @@ trait BaseLimitExec extends UnaryExecNode with 
CodegenSupport {
   }
 
   override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: 
ExprCode): String = {
-val stopEarly =
-  ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "stopEarly") // init 
as stopEarly = false
-
-ctx.addNewFunction("stopEarly", s"""
-  @Override
-  protected boolean stopEarly() {
-return $stopEarly;
-  }
-""", inlineToOuterClass = true)
 val countTerm = ctx.addMutableState(CodeGenerator.JAVA_INT, "count") 
// init as count = 0
 s"""
| if ($countTerm < $limit) {
|   $countTerm += 1;
+   |   if ($countTerm == $limit) {
+   | setStopEarly(true);
--- End diff --

won't we call `shouldStop` inside `consume`? if it does, `stopEarly` will 
not be set.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22542: [SPARK-25519][SQL] ArrayRemove function may return incor...

2018-09-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22542
  
LGTM. The PR description has a typo: `ArrayPosition` => `ArrayRemove` 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22524: [SPARK-25497][SQL] Limit operation within whole s...

2018-09-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22524#discussion_r220044584
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
 ---
@@ -465,13 +465,18 @@ case class RangeExec(range: 
org.apache.spark.sql.catalyst.plans.logical.Range)
   |   $initRangeFuncName(partitionIndex);
   | }
   |
-  | while (true) {
+  | while (true && !stopEarly()) {
   |   long $range = $batchEnd - $number;
   |   if ($range != 0L) {
   | int $localEnd = (int)($range / ${step}L);
   | for (int $localIdx = 0; $localIdx < $localEnd; $localIdx++) {
   |   long $value = ((long)$localIdx * ${step}L) + $number;
+  |   $numOutput.add(1);
--- End diff --

ok. then I should revert the `numOutput` change if the number of records 
can be a bit inaccurate.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22511: [SPARK-25422][CORE] Don't memory map blocks streamed to ...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22511
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22511: [SPARK-25422][CORE] Don't memory map blocks streamed to ...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22511
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96526/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22511: [SPARK-25422][CORE] Don't memory map blocks streamed to ...

2018-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22511
  
**[Test build #96526 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96526/testReport)**
 for PR 22511 at commit 
[`aee82ab`](https://github.com/apache/spark/commit/aee82abe4cd9fbefa14fb280644276fe491bcf9a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22524: [SPARK-25497][SQL] Limit operation within whole s...

2018-09-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22524#discussion_r220044271
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -71,22 +71,14 @@ trait BaseLimitExec extends UnaryExecNode with 
CodegenSupport {
   }
 
   override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: 
ExprCode): String = {
-val stopEarly =
-  ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "stopEarly") // init 
as stopEarly = false
-
-ctx.addNewFunction("stopEarly", s"""
-  @Override
-  protected boolean stopEarly() {
-return $stopEarly;
-  }
-""", inlineToOuterClass = true)
 val countTerm = ctx.addMutableState(CodeGenerator.JAVA_INT, "count") 
// init as count = 0
 s"""
| if ($countTerm < $limit) {
|   $countTerm += 1;
+   |   if ($countTerm == $limit) {
+   | setStopEarly(true);
--- End diff --

shall we do this after `consume`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22524: [SPARK-25497][SQL] Limit operation within whole s...

2018-09-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22524#discussion_r220044149
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
 ---
@@ -465,13 +465,18 @@ case class RangeExec(range: 
org.apache.spark.sql.catalyst.plans.logical.Range)
   |   $initRangeFuncName(partitionIndex);
   | }
   |
-  | while (true) {
+  | while (true && !stopEarly()) {
   |   long $range = $batchEnd - $number;
   |   if ($range != 0L) {
   | int $localEnd = (int)($range / ${step}L);
   | for (int $localIdx = 0; $localIdx < $localEnd; $localIdx++) {
   |   long $value = ((long)$localIdx * ${step}L) + $number;
+  |   $numOutput.add(1);
--- End diff --

This is very likely to hit perf regression since it's not a tight loop 
anymore.

We want the range operator to stop earlier for better performance, but it 
doesn't mean the range operator must return exactly the `limit` number of 
records. Since the range operator is already returning data in batch, I think 
we can stop earlier in a batch granularity.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22542: [SPARK-25519][SQL] ArrayRemove function may return incor...

2018-09-24 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22542
  
cc @ueshin @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22525: [SPARK-25503][CORE][WEBUI]Total task message in stage pa...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22525
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96523/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22525: [SPARK-25503][CORE][WEBUI]Total task message in stage pa...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22525
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22525: [SPARK-25503][CORE][WEBUI]Total task message in stage pa...

2018-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22525
  
**[Test build #96523 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96523/testReport)**
 for PR 22525 at commit 
[`b224a31`](https://github.com/apache/spark/commit/b224a31718b3a5c6ce8754322b1e7d2dd364ac14).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22484: [SPARK-25476][SPARK-25510][TEST] Refactor Aggrega...

2018-09-24 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22484#discussion_r220043531
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/SqlBasedBenchmark.scala
 ---
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.{Benchmark, BenchmarkBase}
+import org.apache.spark.sql.{AnalysisException, SparkSession}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * Common base trait to run benchmark with the Dataset and DataFrame API.
+ */
+trait SqlBasedBenchmark extends BenchmarkBase {
+
+  val spark: SparkSession = getSparkSession
+
+  /** Subclass can override this function to build their own SparkSession 
*/
+  def getSparkSession: SparkSession = {
+SparkSession.builder()
+  .master("local[1]")
+  .appName(this.getClass.getCanonicalName)
+  .config(SQLConf.SHUFFLE_PARTITIONS.key, 1)
+  .config(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key, 1)
+  .getOrCreate()
+  }
+
+  /**
+   * Sets all SQL configurations specified in `pairs`, calls `f`, and then 
restores all SQL
+   * configurations.
+   */
+  protected def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
--- End diff --

Shall we avoid duplicating the existing logic `withSQLConf`? Let me try to 
fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22524: [SPARK-25497][SQL] Limit operation within whole s...

2018-09-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22524#discussion_r220043421
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/BufferedRowIterator.java 
---
@@ -73,14 +78,21 @@ public void append(InternalRow row) {
 currentRows.add(row);
   }
 
+  /**
+   * Sets the flag of stopping the query execution early.
+   */
+  public void setStopEarly(boolean value) {
--- End diff --

can we have more documents about how to use it? For now I see 2 use cases:
1. limit operator should call it with `true` when the limit is hit
2. blocking operator(sort, agg, etc.) should call it with `false` to reset 
it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22466: [SPARK-25464][SQL]On dropping the Database it will drop ...

2018-09-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22466
  
yea, in Spark we conflate the two and treat a table as external if location 
is specified.

However, Hive doesn't have external database, see: 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/Alter/UseDatabase

I don't want to introduce unnecessary behavior difference from hive, and I 
feel it's not very useful to have external database. Although your table files 
can be in an existing folder, but LIST TABLES will not work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.writeLega...

2018-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22453
  
**[Test build #96532 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96532/testReport)**
 for PR 22453 at commit 
[`6e2680f`](https://github.com/apache/spark/commit/6e2680f3a53cd617c85f424941e13ff47d98aea9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.wr...

2018-09-24 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22453#discussion_r220042478
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1002,6 +1002,21 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 
   
 
+
+  spark.sql.parquet.writeLegacyFormat
+  false
+  
+This configuration indicates whether we should use legacy Parquet 
format adopted by Spark 1.4
+and prior versions or the standard format defined in parquet-format 
specification to write
+Parquet files. This is not only related to compatibility with old 
Spark ones, but also other
+systems like Hive, Impala, Presto, etc. This is especially important 
for decimals. If this
+configuration is not enabled, decimals will be written in int-based 
format in Spark 1.5 and
+above, other systems that only support legacy decimal format (fixed 
length byte array) will not
+be able to read what Spark has written. Note other systems may have 
added support for the
+standard format in more recent versions, which will make this 
configuration unnecessary. Please
--- End diff --

Thanks for your suggestion. I have updated the doc in SQLConf.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22535
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96528/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22535
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

2018-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22535
  
**[Test build #96528 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96528/testReport)**
 for PR 22535 at commit 
[`c95706f`](https://github.com/apache/spark/commit/c95706f60e4d576caca78a32000d4a7bbb12c141).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22524: [SPARK-25497][SQL] Limit operation within whole s...

2018-09-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22524#discussion_r220040370
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/BufferedRowIterator.java 
---
@@ -38,6 +38,11 @@
 
   protected int partitionIndex = -1;
 
+  // This indicates whether the query execution should be stopped even the 
input rows are still
+  // available. This is used in limit operator. When it reaches the given 
number of rows to limit,
+  // this flag is set and the execution should be stopped.
+  protected boolean isStopEarly = false;
--- End diff --

I've added a test for 2 limits.

When any of 2 limits sets `isStopEarly`, I think the execution should be 
stopped. Is there any case opposite to this?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22542: [SPARK-25519][SQL] ArrayRemove function may return incor...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22542
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96525/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22542: [SPARK-25519][SQL] ArrayRemove function may return incor...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22542
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22542: [SPARK-25519][SQL] ArrayRemove function may return incor...

2018-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22542
  
**[Test build #96525 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96525/testReport)**
 for PR 22542 at commit 
[`ac9cb1a`](https://github.com/apache/spark/commit/ac9cb1a2d2e5b73f6c6e89c510f84a1f10efb60e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22316: [SPARK-25048][SQL] Pivoting by multiple columns in Scala...

2018-09-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22316
  
One safe change is to not use the `lit` function, but to do a manual 
pattern match and still use `Literal.apply`. We can investigate 
`Literal.create` in a followup


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22524: [SPARK-25497][SQL] Limit operation within whole s...

2018-09-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22524#discussion_r220039084
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/BufferedRowIterator.java 
---
@@ -38,6 +38,11 @@
 
   protected int partitionIndex = -1;
 
+  // This indicates whether the query execution should be stopped even the 
input rows are still
+  // available. This is used in limit operator. When it reaches the given 
number of rows to limit,
+  // this flag is set and the execution should be stopped.
+  protected boolean isStopEarly = false;
--- End diff --

what if there are 2 limits in the query? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22325: [SPARK-25318]. Add exception handling when wrapping the ...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22325
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96522/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22325: [SPARK-25318]. Add exception handling when wrapping the ...

2018-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22325
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22325: [SPARK-25318]. Add exception handling when wrapping the ...

2018-09-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22325
  
**[Test build #96522 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96522/testReport)**
 for PR 22325 at commit 
[`296f65b`](https://github.com/apache/spark/commit/296f65bea18f9ae436d8c34ce43fe8c12b46e834).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.wr...

2018-09-24 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22453#discussion_r220038438
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1002,6 +1002,21 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 
   
 
+
+  spark.sql.parquet.writeLegacyFormat
+  false
+  
+This configuration indicates whether we should use legacy Parquet 
format adopted by Spark 1.4
+and prior versions or the standard format defined in parquet-format 
specification to write
+Parquet files. This is not only related to compatibility with old 
Spark ones, but also other
+systems like Hive, Impala, Presto, etc. This is especially important 
for decimals. If this
+configuration is not enabled, decimals will be written in int-based 
format in Spark 1.5 and
+above, other systems that only support legacy decimal format (fixed 
length byte array) will not
+be able to read what Spark has written. Note other systems may have 
added support for the
+standard format in more recent versions, which will make this 
configuration unnecessary. Please
--- End diff --

Hive and Impala do NOT support new Parquet format yet.

* [HIVE-19069](https://jira.apache.org/jira/browse/HIVE-19069): Hive can't 
read int32 and int64 Parquet decimal. This issue is not resolved yet. This is 
consistent with source code check by @HyukjinKwon 
* [IMPALA-5542](https://issues.apache.org/jira/browse/IMPALA-5542): Impala 
cannot scan Parquet decimal stored as int64_t/int32_t. This is resolved, 
however targeted to Impala 3.1.0, which is a version not released yet. The 
latest release is 3.0.0 (https://impala.apache.org/downloads.html).

Presto began to support new Parquet format since 0.182.

* [issues/7533](https://github.com/prestodb/presto/issues/7533): Improve 
decimal type support in the new Parquet reader.  This patch is included in 
[0.182](https://prestodb.io/docs/current/release/release-0.182.html). Blow is 
the excerpt: 

> Fix reading decimal values in the optimized Parquet reader when they are 
backed by the int32 or int64 types.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >