date:20190806

[GitHub] [spark] AmplabJenkins removed a comment on issue #24761: [SPARK-27905] [SQL] Add higher order function 'forall'

2019-08-06 Thread GitBox

AmplabJenkins removed a comment on issue #24761: [SPARK-27905] [SQL] Add higher 
order function 'forall'
URL: https://github.com/apache/spark/pull/24761#issuecomment-518839775
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #25309: [SPARK-28577][YARN]Resource capability requested for each executor add offHeapMemorySize

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #25309: [SPARK-28577][YARN]Resource capability 
requested for each executor add offHeapMemorySize 
URL: https://github.com/apache/spark/pull/25309#issuecomment-518841025
 
 
   Can one of the admins verify this patch?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vanzin commented on a change in pull request #25341: [SPARK-28607][CORE][SHUFFLE] Don't store partition lengths twice.

2019-08-06 Thread GitBox

vanzin commented on a change in pull request #25341: 
[SPARK-28607][CORE][SHUFFLE] Don't store partition lengths twice.
URL: https://github.com/apache/spark/pull/25341#discussion_r311271811
 
 

 ##
 File path: 
core/src/main/java/org/apache/spark/shuffle/api/ShufflePartitionWriter.java
 ##
 @@ -85,14 +85,4 @@
   default Optional openChannelWrapper() throws 
IOException {
 return Optional.empty();
   }
-
-  /**
-   * Returns the number of bytes written either by this writer's output stream 
opened by
-   * {@link #openStream()} or the byte channel opened by {@link 
#openChannelWrapper()}.
-   * 
-   * This can be different from the number of bytes given by the caller. For 
example, the
-   * stream might compress or encrypt the bytes before persisting the data to 
the backing
-   * data store.
-   */
-  long getNumBytesWritten();
 
 Review comment:
   Could you update the metrics after the commit (by adding up all the 
partition lengths)?
   
   Otherwise it doesn't seem horrible to keep this in the API.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on issue #25366: [SPARK-27918][SQL][TEST][FOLLOW-UP] Open comment about boolean test.

2019-08-06 Thread GitBox

dongjoon-hyun commented on issue #25366: [SPARK-27918][SQL][TEST][FOLLOW-UP] 
Open comment about boolean test.
URL: https://github.com/apache/spark/pull/25366#issuecomment-518842348
 
 
   @beliefer . Please regenerate the result. That will recover the UT failures.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on issue #25366: [SPARK-27924][SQL][TEST][FOLLOW-UP] Open comment about boolean test

2019-08-06 Thread GitBox

dongjoon-hyun commented on issue #25366: [SPARK-27924][SQL][TEST][FOLLOW-UP] 
Open comment about boolean test
URL: https://github.com/apache/spark/pull/25366#issuecomment-518844007
 
 
   BTW, @beliefer . This had better be a followup of yours ([SPARK-27924][SQL] 
Support ANSI SQL Boolean-Predicate syntax). I updated the PR title. Also, I 
changed `How was this patch tested?` section. Since you updated the test cases 
which are verified by Jenkins. It should not be 'N/A'.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ueshin commented on issue #24761: [SPARK-27905] [SQL] Add higher order function 'forall'

2019-08-06 Thread GitBox

ueshin commented on issue #24761: [SPARK-27905] [SQL] Add higher order function 
'forall'
URL: https://github.com/apache/spark/pull/24761#issuecomment-518850060
 
 
   Thanks! merging to master.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vanzin commented on a change in pull request #25369: [SPARK-28638][WebUI] Task summary metrics are wrong when there are running tasks

2019-08-06 Thread GitBox

vanzin commented on a change in pull request #25369: [SPARK-28638][WebUI] Task 
summary metrics are wrong when there are running tasks
URL: https://github.com/apache/spark/pull/25369#discussion_r311282837
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala
 ##
 @@ -156,7 +156,8 @@ private[spark] class AppStatusStore(
 // cheaper for disk stores (avoids deserialization).
 val count = {
   Utils.tryWithResource(
-if (store.isInstanceOf[InMemoryStore]) {
+if (store.isInstanceOf[ElementTrackingStore]) {
 
 Review comment:
   I think a new method `private def isLiveUI: Boolean` would make the intent 
here clearer. It can be this instance check or just checking whether `listener` 
is defined.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ueshin closed pull request #24761: [SPARK-27905] [SQL] Add higher order function 'forall'

2019-08-06 Thread GitBox

ueshin closed pull request #24761: [SPARK-27905] [SQL] Add higher order 
function 'forall'
URL: https://github.com/apache/spark/pull/24761
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #13627: [SPARK-15906][MLlib] Add complementary naive bayes algorithm

2019-08-06 Thread GitBox

AmplabJenkins removed a comment on issue #13627: [SPARK-15906][MLlib] Add 
complementary naive bayes algorithm
URL: https://github.com/apache/spark/pull/13627#issuecomment-406583815
 
 
   Can one of the admins verify this patch?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #13627: [SPARK-15906][MLlib] Add complementary naive bayes algorithm

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #13627: [SPARK-15906][MLlib] Add complementary 
naive bayes algorithm
URL: https://github.com/apache/spark/pull/13627#issuecomment-518854134
 
 
   Can one of the admins verify this patch?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao opened a new pull request #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base

2019-08-06 Thread GitBox

huaxingao opened a new pull request #25371: [SPARK-28393][SQL][PYTHON][TESTS] 
Convert and port 'pgSQL/join.sql' into UDF test base
URL: https://github.com/apache/spark/pull/25371
 
 
   
   
   ## What changes were proposed in this pull request?
   
   This PR adds some tests converted from ```pgSQL/join.sql``` to test UDFs. 
Please see contribution guide of this umbrella ticket - 
[SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
   Diff comparing to 'join.sql'
   
   
   ```diff
   diff --git 
a/sql/core/src/test/resources/sql-tests/results/pgSQL/join.sql.out 
b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-join.sql.out
   index 0730066719..4e8ab9d71a 100644
   --- a/sql/core/src/test/resources/sql-tests/results/pgSQL/join.sql.out
   +++ 
b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-join.sql.out
   @@ -240,10 +240,10 @@ struct<>


-- !query 27
   -SELECT '' AS `xxx`, *
   +SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t)
  FROM J1_TBL AS tx
-- !query 27 schema
   -struct
   +struct
-- !query 27 output
   0   NULLzero
   1   4   one
   @@ -259,10 +259,10 @@ struct


-- !query 28
   -SELECT '' AS `xxx`, *
   +SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t)
  FROM J1_TBL tx
-- !query 28 schema
   -struct
   +struct
-- !query 28 output
   0   NULLzero
   1   4   one
   @@ -278,10 +278,10 @@ struct


-- !query 29
   -SELECT '' AS `xxx`, *
   +SELECT udf('') AS `xxx`, udf(a), udf(b), udf(c)
  FROM J1_TBL AS t1 (a, b, c)
-- !query 29 schema
   -struct
   +struct
-- !query 29 output
   0   NULLzero
   1   4   one
   @@ -297,10 +297,10 @@ struct


-- !query 30
   -SELECT '' AS `xxx`, *
   +SELECT udf('') AS `xxx`, udf(a), udf(b), udf(c)
  FROM J1_TBL t1 (a, b, c)
-- !query 30 schema
   -struct
   +struct
-- !query 30 output
   0   NULLzero
   1   4   one
   @@ -316,10 +316,10 @@ struct


-- !query 31
   -SELECT '' AS `xxx`, *
   +SELECT udf('') AS `xxx`, udf(a), udf(b), udf(c), udf(d), udf(e)
  FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e)
-- !query 31 schema
   -struct
   +struct
-- !query 31 output
   0   NULLzero0   NULL
   0   NULLzero1   -1
   @@ -423,7 +423,7 @@ struct


-- !query 32
   -SELECT '' AS `xxx`, *
   +SELECT udf('') AS `xxx`, *
  FROM J1_TBL CROSS JOIN J2_TBL
-- !query 32 schema
struct
   @@ -530,20 +530,20 @@ struct


-- !query 33
   -SELECT '' AS `xxx`, i, k, t
   +SELECT udf('') AS `xxx`, udf(i), udf(k), udf(t)
  FROM J1_TBL CROSS JOIN J2_TBL
-- !query 33 schema
struct<>
-- !query 33 output
org.apache.spark.sql.AnalysisException
   -Reference 'i' is ambiguous, could be: default.j1_tbl.i, default.j2_tbl.i.; 
line 1 pos 20
   +Reference 'i' is ambiguous, could be: default.j1_tbl.i, default.j2_tbl.i.; 
line 1 pos 29


-- !query 34
   -SELECT '' AS `xxx`, t1.i, k, t
   +SELECT udf('') AS `xxx`, udf(t1.i), udf(k), udf(t)
  FROM J1_TBL t1 CROSS JOIN J2_TBL t2
-- !query 34 schema
   -struct
   +struct
-- !query 34 output
   0   -1  zero
   0   -3  zero
   @@ -647,11 +647,11 @@ struct


-- !query 35
   -SELECT '' AS `xxx`, ii, tt, kk
   +SELECT udf('') AS `xxx`, udf(ii), udf(tt), udf(kk)
  FROM (J1_TBL CROSS JOIN J2_TBL)
AS tx (ii, jj, tt, ii2, kk)
-- !query 35 schema
   -struct
   +struct
-- !query 35 output
   0   zero-1
   0   zero-3
   @@ -755,10 +755,10 @@ struct


-- !query 36
   -SELECT '' AS `xxx`, *
   +SELECT udf('') AS `xxx`, udf(j1_tbl.i), udf(j), udf(t), udf(a.i), udf(a.k), 
udf(b.i),  udf(b.k)
  FROM J1_TBL CROSS JOIN J2_TBL a CROSS JOIN J2_TBL b
-- !query 36 schema
   -struct
   +struct
-- !query 36 output
   0   NULLzero0   NULL0   NULL
   0   NULLzero0   NULL1   -1
   @@ -1654,10 +1654,10 @@ 
struct


-- !query 37
   -SELECT '' AS `xxx`, *
   +SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t), udf(k)
  FROM J1_TBL INNER JOIN J2_TBL USING (i)
-- !query 37 schema
   -struct
   +struct
-- !query 37 output
   0   NULLzeroNULL
   1   4   one -1
   @@ -1669,10 +1669,10 @@ struct


-- !query 38
   -SELECT '' AS `xxx`, *
   +SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t), udf(k)
  FROM J1_TBL JOIN J2_TBL USING (i)
-- !query 38 schema
   -struct
   +struct
-- !query 38 output
   0   NULLzeroNULL
   1   4   one -1
   @@ -1684,9 +1684,9 @@ struct


-- !query 39
   -SELECT '' AS `xxx`, *
   +SELECT udf('') AS `xxx`, *
  FROM J1_TBL t1 (a, b, c) JOIN J2_TBL t2 (a,

[GitHub] [spark] huaxingao commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base

2019-08-06 Thread GitBox

huaxingao commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert 
and port 'pgSQL/join.sql' into UDF test base
URL: https://github.com/apache/spark/pull/25371#issuecomment-518855906
 
 
   @HyukjinKwon @viirya 
   Sorry I somehow messed up the original PR, so I closed it and opened a new 
one. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] 
Convert and port 'pgSQL/join.sql' into UDF test base
URL: https://github.com/apache/spark/pull/25371#issuecomment-518855909
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] 
Convert and port 'pgSQL/join.sql' into UDF test base
URL: https://github.com/apache/spark/pull/25371#issuecomment-518855919
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/13814/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base

2019-08-06 Thread GitBox

SparkQA commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert 
and port 'pgSQL/join.sql' into UDF test base
URL: https://github.com/apache/spark/pull/25371#issuecomment-518856479
 
 
   **[Test build #108732 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108732/testReport)**
 for PR 25371 at commit 
[`b7aad47`](https://github.com/apache/spark/commit/b7aad47c0121501343729c7d73a75df74722267e).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base

2019-08-06 Thread GitBox

AmplabJenkins removed a comment on issue #25371: 
[SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF 
test base
URL: https://github.com/apache/spark/pull/25371#issuecomment-518855909
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base

2019-08-06 Thread GitBox

AmplabJenkins removed a comment on issue #25371: 
[SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF 
test base
URL: https://github.com/apache/spark/pull/25371#issuecomment-518855919
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/13814/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vanzin commented on a change in pull request #25304: [SPARK-28570][CORE][SHUFFLE] Make UnsafeShuffleWriter use the new API.

2019-08-06 Thread GitBox

vanzin commented on a change in pull request #25304: 
[SPARK-28570][CORE][SHUFFLE] Make UnsafeShuffleWriter use the new API.
URL: https://github.com/apache/spark/pull/25304#discussion_r311290914
 
 

 ##
 File path: 
core/src/main/java/org/apache/spark/shuffle/api/ShuffleExecutorComponents.java
 ##
 @@ -39,17 +40,39 @@
   /**
* Called once per map task to create a writer that will be responsible for 
persisting all the
* partitioned bytes written by that map task.
-   *  @param shuffleId Unique identifier for the shuffle the map task is a 
part of
+   *
+   * @param shuffleId Unique identifier for the shuffle the map task is a part 
of
* @param mapId Within the shuffle, the identifier of the map task
* @param mapTaskAttemptId Identifier of the task attempt. Multiple attempts 
of the same map task
- * with the same (shuffleId, mapId) pair can be 
distinguished by the
- * different values of mapTaskAttemptId.
+   * with the same (shuffleId, mapId) pair can be 
distinguished by the
+   * different values of mapTaskAttemptId.
* @param numPartitions The number of partitions that will be written by the 
map task. Some of
-*  these partitions may be empty.
+   *  these partitions may be empty.
*/
   ShuffleMapOutputWriter createMapOutputWriter(
   int shuffleId,
   int mapId,
   long mapTaskAttemptId,
   int numPartitions) throws IOException;
+
+  /**
+   * An optional extension for creating a map output writer that can optimize 
the transfer of a
+   * single partition file, as the entire result of a map task, to the backing 
store.
+   * 
+   * Most implementations should return the default {@link Optional#empty()} 
to indicate that
+   * they do not support this optimization. This primarily is for 
backwards-compatibility in
+   * preserving an optimization in the local disk shuffle storage 
implementation.
+   *
+   * @param shuffleId Unique identifier for the shuffle the map task is a part 
of
+   * @param mapId Within the shuffle, the identifier of the map task
+   * @param mapTaskAttemptId Identifier of the task attempt. Multiple attempts 
of the same map task
+   * with the same (shuffleId, mapId) pair can be 
distinguished by the
+   * different values of mapTaskAttemptId.
+   */
+  default Optional 
createSingleFileMapOutputWriter(
 
 Review comment:
   This feels like a lot of indirection to implement one method... what about 
returning a boolean if the transfer is supported, or throwing 
`UnsupportedOperationException` (although that's a bit slower)?
   
   (If the transfer is supported but fails you'd still throw an `IOException`.)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on issue #25366: [SPARK-27924][SQL][TEST][FOLLOW-UP] Open comment about boolean test

2019-08-06 Thread GitBox

dongjoon-hyun commented on issue #25366: [SPARK-27924][SQL][TEST][FOLLOW-UP] 
Open comment about boolean test
URL: https://github.com/apache/spark/pull/25366#issuecomment-518858585
 
 
   I made a PR to you, @beliefer . Please review and merge.
   - https://github.com/beliefer/spark/pull/2


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vanzin commented on a change in pull request #25342: [SPARK-28571][CORE][SHUFFLE] Use the shuffle writer plugin for the SortShuffleWriter

2019-08-06 Thread GitBox

vanzin commented on a change in pull request #25342: 
[SPARK-28571][CORE][SHUFFLE] Use the shuffle writer plugin for the 
SortShuffleWriter
URL: https://github.com/apache/spark/pull/25342#discussion_r311293073
 
 

 ##
 File path: 
core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala
 ##
 @@ -718,6 +717,85 @@ private[spark] class ExternalSorter[K, V, C](
 lengths
   }
 
+  /**
+   * Write all the data added into this ExternalSorter into a map output 
writer that pushes bytes
+   * to some arbitrary backing store. This is called by the SortShuffleWriter.
+   *
+   * @return array of lengths, in bytes, of each partition of the file (used 
by map output tracker)
+   */
+  def writePartitionedMapOutput(
+  shuffleId: Int, mapId: Int, mapOutputWriter: ShuffleMapOutputWriter): 
Array[Long] = {
 
 Review comment:
   nit: multi-line arg style


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vanzin commented on a change in pull request #25342: [SPARK-28571][CORE][SHUFFLE] Use the shuffle writer plugin for the SortShuffleWriter

2019-08-06 Thread GitBox

vanzin commented on a change in pull request #25342: 
[SPARK-28571][CORE][SHUFFLE] Use the shuffle writer plugin for the 
SortShuffleWriter
URL: https://github.com/apache/spark/pull/25342#discussion_r311295878
 
 

 ##
 File path: 
core/src/main/scala/org/apache/spark/util/collection/ShufflePartitionPairsWriter.scala
 ##
 @@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import java.io.{Closeable, FilterOutputStream, OutputStream}
+
+import org.apache.spark.serializer.{SerializationStream, SerializerInstance, 
SerializerManager}
+import org.apache.spark.shuffle.ShuffleWriteMetricsReporter
+import org.apache.spark.shuffle.api.ShufflePartitionWriter
+import org.apache.spark.storage.BlockId
+
+/**
+ * A key-value writer inspired by {@link DiskBlockObjectWriter} that pushes 
the bytes to an
+ * arbitrary partition writer instead of writing to local disk through the 
block manager.
+ */
+private[spark] class ShufflePartitionPairsWriter(
+partitionWriter: ShufflePartitionWriter,
+serializerManager: SerializerManager,
+serializerInstance: SerializerInstance,
+blockId: BlockId,
+writeMetrics: ShuffleWriteMetricsReporter)
+  extends PairsWriter with Closeable {
+
+  private var isOpen = false
+  private var partitionStream: OutputStream = _
+  private var wrappedStream: OutputStream = _
+  private var objOut: SerializationStream = _
+  private var numRecordsWritten = 0
+  private var curNumBytesWritten = 0L
+
+  override def write(key: Any, value: Any): Unit = {
+if (!isOpen) {
+  open()
+  isOpen = true
+}
+objOut.writeKey(key)
+objOut.writeValue(value)
+writeMetrics.incRecordsWritten(1)
+  }
+
+  private def open(): Unit = {
+partitionStream = partitionWriter.openStream
+wrappedStream = serializerManager.wrapStream(blockId, partitionStream)
+objOut = serializerInstance.serializeStream(wrappedStream)
+  }
+
+  override def close(): Unit = {
+if (isOpen) {
 
 Review comment:
   Minor, but if there's an error in `open()` (e.g. when initializing 
`wrappedStream`) this will leave the underlying `partitionStream` opened.
   
   Maybe this flag isn't needed and you can just check whether the fields are 
initialized?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vanzin commented on issue #25236: [SPARK-28487][k8s] More responsive dynamic allocation with K8S.

2019-08-06 Thread GitBox

vanzin commented on issue #25236: [SPARK-28487][k8s] More responsive dynamic 
allocation with K8S.
URL: https://github.com/apache/spark/pull/25236#issuecomment-518863671
 
 
   Let's see if @mccheah has anything to add, otherwise I'll end up pushing 
before EOW.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on issue #25229: [SPARK-27900][K8s] Add jvm oom flag

2019-08-06 Thread GitBox

dongjoon-hyun commented on issue #25229: [SPARK-27900][K8s] Add jvm oom flag
URL: https://github.com/apache/spark/pull/25229#issuecomment-518863944
 
 
   Hmm. I got it.
   
   That rang me [my original 
question](https://github.com/apache/spark/pull/25229#pullrequestreview-267298632)
 again. If the users supply `spark.driver.extraJavaOptions`, why do we try to 
enforce this by Spark-side?
   >  Right now user supplied spark.driver.extraJavaOptions will be ignored 
because they are part of a properties files
   
   For me, the documentation update seems to be enough. How do you think about 
that?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun edited a comment on issue #25229: [SPARK-27900][K8s] Add jvm oom flag

2019-08-06 Thread GitBox

dongjoon-hyun edited a comment on issue #25229: [SPARK-27900][K8s] Add jvm oom 
flag
URL: https://github.com/apache/spark/pull/25229#issuecomment-518863944
 
 
   Hmm. I got it.
   
   That rang me [my original 
question](https://github.com/apache/spark/pull/25229#pullrequestreview-267298632)
 again. If the users supply `spark.driver.extraJavaOptions`, why do we try to 
enforce this by Spark-side?
   >  Right now user supplied spark.driver.extraJavaOptions will be ignored 
because they are part of a properties files
   
   For me, the documentation update seems to be enough. How do you think about 
that? Did I miss something?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] shivsood commented on a change in pull request #25344: [WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect

2019-08-06 Thread GitBox

shivsood commented on a change in pull request #25344: [WIP][SPARK-28151][SQL] 
Mapped ByteType to TinyINT for MsSQLServerDialect
URL: https://github.com/apache/spark/pull/25344#discussion_r311300690
 
 

 ##
 File path: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala
 ##
 @@ -202,4 +204,25 @@ class MsSqlServerIntegrationSuite extends 
DockerJDBCIntegrationSuite {
 df2.write.jdbc(jdbcUrl, "datescopy", new Properties)
 df3.write.jdbc(jdbcUrl, "stringscopy", new Properties)
   }
+
+  test("SPARK-28151 Test write table with BYTETYPE") {
+val tableSchema = StructType(Seq(StructField("serialNum", ByteType, true)))
+val tableData = Seq(Row(10))
+val df1 = spark.createDataFrame(
+  spark.sparkContext.parallelize(tableData),
+  tableSchema)
+
+df1.write
+  .format("jdbc")
+  .mode("overwrite")
+  .option("url", jdbcUrl)
+  .option("dbtable", "testTable")
+  .save()
+val df2 = spark.read
+  .format("jdbc")
+  .option("url", jdbcUrl)
+  .option("dbtable", "byteTable")
+  .load()
+df2.show()
 
 Review comment:
   Yes, would add a check of row counts.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] shivsood commented on a change in pull request #25344: [WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect

2019-08-06 Thread GitBox

shivsood commented on a change in pull request #25344: [WIP][SPARK-28151][SQL] 
Mapped ByteType to TinyINT for MsSQLServerDialect
URL: https://github.com/apache/spark/pull/25344#discussion_r311300562
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ##
 @@ -550,7 +550,7 @@ object JdbcUtils extends Logging {
 
 case ByteType =>
   (stmt: PreparedStatement, row: Row, pos: Int) =>
-stmt.setInt(pos + 1, row.getByte(pos))
+stmt.setByte(pos + 1, row.getByte(pos))
 
 Review comment:
   Yes


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vanzin commented on a change in pull request #25135: [SPARK-28367][SS] Use new KafkaConsumer.poll API in Kafka connector

2019-08-06 Thread GitBox

vanzin commented on a change in pull request #25135: [SPARK-28367][SS] Use new 
KafkaConsumer.poll API in Kafka connector
URL: https://github.com/apache/spark/pull/25135#discussion_r311300497
 
 

 ##
 File path: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala
 ##
 @@ -419,6 +416,21 @@ private[kafka010] class KafkaOffsetReader(
 stopConsumer()
 _consumer = null  // will automatically get reinitialized again
   }
+
+  private def getPartitions(): ju.Set[TopicPartition] = {
+consumer.poll(jt.Duration.ZERO)
+var partitions = consumer.assignment()
+val startTimeMs = System.currentTimeMillis()
 
 Review comment:
   For this kind of logic it's better to use `System.nanoTime()` which is 
monotonic. Also you can do a little less computation this way:
   
   ```
   val deadline = System.nanoTime() + someTimeout;
   while (... && System.nanoTime() < deadline) {
   
   }
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] icexelloss commented on a change in pull request #24981: [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs

2019-08-06 Thread GitBox

icexelloss commented on a change in pull request #24981: [SPARK-27463][PYTHON] 
Support Dataframe Cogroup via Pandas UDFs
URL: https://github.com/apache/spark/pull/24981#discussion_r311301959
 
 

 ##
 File path: python/pyspark/sql/tests/test_pandas_udf_cogrouped_map.py
 ##
 @@ -0,0 +1,285 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import datetime
+import unittest
+import sys
+
+from collections import OrderedDict
+from decimal import Decimal
+
+from pyspark.sql import Row
+from pyspark.sql.functions import array, explode, col, lit, udf, sum, 
pandas_udf, PandasUDFType
+from pyspark.sql.types import *
+from pyspark.testing.sqlutils import ReusedSQLTestCase, have_pandas, 
have_pyarrow, \
+pandas_requirement_message, pyarrow_requirement_message
+from pyspark.testing.utils import QuietTest
+
+if have_pandas:
+import pandas as pd
+from pandas.util.testing import assert_frame_equal, assert_series_equal
+
+if have_pyarrow:
+import pyarrow as pa
+
+
+"""
+Tests below use pd.DataFrame.assign that will infer mixed types (unicode/str) 
for column names
+from kwargs w/ Python 2, so need to set check_column_type=False and avoid this 
check
+"""
+if sys.version < '3':
+_check_column_type = False
+else:
+_check_column_type = True
+
+
+@unittest.skipIf(
+not have_pandas or not have_pyarrow,
+pandas_requirement_message or pyarrow_requirement_message)
+class CoGroupedMapPandasUDFTests(ReusedSQLTestCase):
+
+@property
+def data1(self):
+return self.spark.range(10).toDF('id') \
+.withColumn("ks", array([lit(i) for i in range(20, 30)])) \
+.withColumn("k", explode(col('ks')))\
+.withColumn("v", col('k') * 10)\
+.drop('ks')
+
+@property
+def data2(self):
+return self.spark.range(10).toDF('id') \
+.withColumn("ks", array([lit(i) for i in range(20, 30)])) \
+.withColumn("k", explode(col('ks'))) \
+.withColumn("v2", col('k') * 100) \
+.drop('ks')
+
+def test_simple(self):
+self._test_merge(self.data1, self.data2)
+
+def test_left_group_empty(self):
+left = self.data1.where(col("id") % 2 == 0)
+self._test_merge(left, self.data2)
+
+def test_right_group_empty(self):
+right = self.data2.where(col("id") % 2 == 0)
+self._test_merge(self.data1, right)
+
+def test_different_schemas(self):
+right = self.data2.withColumn('v3', lit('a'))
+self._test_merge(self.data1, right, 'id long, k int, v int, v2 int, v3 
string')
+
+def test_complex_group_by(self):
+left = pd.DataFrame.from_dict({
+'id': [1, 2, 3],
+'k':  [5, 6, 7],
+'v': [9, 10, 11]
+})
+
+right = pd.DataFrame.from_dict({
+'id': [11, 12, 13],
+'k': [5, 6, 7],
+'v2': [90, 100, 110]
+})
+
+left_df = self.spark\
 
 Review comment:
   Maybe rename to `left_cdf` (left cogrouped dataframe)?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] icexelloss commented on a change in pull request #24981: [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs

2019-08-06 Thread GitBox

icexelloss commented on a change in pull request #24981: [SPARK-27463][PYTHON] 
Support Dataframe Cogroup via Pandas UDFs
URL: https://github.com/apache/spark/pull/24981#discussion_r311301959
 
 

 ##
 File path: python/pyspark/sql/tests/test_pandas_udf_cogrouped_map.py
 ##
 @@ -0,0 +1,285 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import datetime
+import unittest
+import sys
+
+from collections import OrderedDict
+from decimal import Decimal
+
+from pyspark.sql import Row
+from pyspark.sql.functions import array, explode, col, lit, udf, sum, 
pandas_udf, PandasUDFType
+from pyspark.sql.types import *
+from pyspark.testing.sqlutils import ReusedSQLTestCase, have_pandas, 
have_pyarrow, \
+pandas_requirement_message, pyarrow_requirement_message
+from pyspark.testing.utils import QuietTest
+
+if have_pandas:
+import pandas as pd
+from pandas.util.testing import assert_frame_equal, assert_series_equal
+
+if have_pyarrow:
+import pyarrow as pa
+
+
+"""
+Tests below use pd.DataFrame.assign that will infer mixed types (unicode/str) 
for column names
+from kwargs w/ Python 2, so need to set check_column_type=False and avoid this 
check
+"""
+if sys.version < '3':
+_check_column_type = False
+else:
+_check_column_type = True
+
+
+@unittest.skipIf(
+not have_pandas or not have_pyarrow,
+pandas_requirement_message or pyarrow_requirement_message)
+class CoGroupedMapPandasUDFTests(ReusedSQLTestCase):
+
+@property
+def data1(self):
+return self.spark.range(10).toDF('id') \
+.withColumn("ks", array([lit(i) for i in range(20, 30)])) \
+.withColumn("k", explode(col('ks')))\
+.withColumn("v", col('k') * 10)\
+.drop('ks')
+
+@property
+def data2(self):
+return self.spark.range(10).toDF('id') \
+.withColumn("ks", array([lit(i) for i in range(20, 30)])) \
+.withColumn("k", explode(col('ks'))) \
+.withColumn("v2", col('k') * 100) \
+.drop('ks')
+
+def test_simple(self):
+self._test_merge(self.data1, self.data2)
+
+def test_left_group_empty(self):
+left = self.data1.where(col("id") % 2 == 0)
+self._test_merge(left, self.data2)
+
+def test_right_group_empty(self):
+right = self.data2.where(col("id") % 2 == 0)
+self._test_merge(self.data1, right)
+
+def test_different_schemas(self):
+right = self.data2.withColumn('v3', lit('a'))
+self._test_merge(self.data1, right, 'id long, k int, v int, v2 int, v3 
string')
+
+def test_complex_group_by(self):
+left = pd.DataFrame.from_dict({
+'id': [1, 2, 3],
+'k':  [5, 6, 7],
+'v': [9, 10, 11]
+})
+
+right = pd.DataFrame.from_dict({
+'id': [11, 12, 13],
+'k': [5, 6, 7],
+'v2': [90, 100, 110]
+})
+
+left_df = self.spark\
 
 Review comment:
   Maybe rename to `left_gdf` (left grouped dataframe)?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] Support Genetic Algorithm based join reorder

2019-08-06 Thread GitBox

yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] 
Support Genetic Algorithm based join reorder
URL: https://github.com/apache/spark/pull/24983#discussion_r311303183
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ##
 @@ -470,3 +397,427 @@ object JoinReorderDPFilters extends PredicateHelper {
  * extended with the set of connected/unconnected plans.
  */
 case class JoinGraphInfo (starJoins: Set[Int], nonStarJoins: Set[Int])
+
+/**
+ * Reorder the joins using a genetic algorithm. The algorithm treat the 
reorder problem
+ * to a traveling salesmen problem, and use genetic algorithm give an 
optimized solution.
+ *
+ * The implementation refs the geqo in postgresql, which is contibuted by 
Darrell Whitley:
+ * https://www.postgresql.org/docs/9.1/geqo-pg-intro.html
+ *
+ * For more info about genetic algorithm and the edge recombination crossover, 
pls see:
+ * "A Genetic Algorithm Tutorial, Darrell Whitley"
+ * https://link.springer.com/article/10.1007/BF00175354
+ * and "Scheduling Problems and Traveling Salesmen: The Genetic Edge 
Recombination Operator,
+ * Darrell Whitley et al." https://dl.acm.org/citation.cfm?id=657238
+ * respectively.
+ */
+object JoinReorderGA extends PredicateHelper with Logging {
+
+  def search(
+  conf: SQLConf,
+  items: Seq[LogicalPlan],
+  conditions: Set[Expression],
+  output: Seq[Attribute]): Option[LogicalPlan] = {
+
+val startTime = System.nanoTime()
+
+val itemsWithIndex = items.zipWithIndex.map {
+  case (plan, id) => id -> JoinPlan(Set(id), plan, Set.empty, Cost(0, 0))
+}.toMap
+
+val topOutputSet = AttributeSet(output)
+
+val pop = Population(conf, itemsWithIndex, conditions, topOutputSet).evolve
+
+val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000)
+logInfo(s"Join reordering finished. Duration: $durationInMs ms, number of 
items: " +
+s"${items.length}, number of plans in memo: ${ pop.chromos.size}")
+
+assert(pop.chromos.head.basicPlans.size == items.length)
+pop.chromos.head.integratedPlan match {
+  case Some(joinPlan) => joinPlan.plan match {
+case p @ Project(projectList, _: Join) if projectList != output =>
+  assert(topOutputSet == p.outputSet)
+  // Keep the same order of final output attributes.
+  Some(p.copy(projectList = output))
+case finalPlan if !sameOutput(finalPlan, output) =>
+  Some(Project(output, finalPlan))
+case finalPlan =>
+  Some(finalPlan)
+  }
+  case _ => None
+}
+  }
+}
+
+/**
+ * A pair of parent individuals can breed a child with certain crossover 
process.
+ * With crossover, child can inherit gene from its parents, and these gene 
snippets
+ * finally compose a new [[Chromosome]].
+ */
+@DeveloperApi
+trait Crossover {
+
+  /**
+   * Generate a new [[Chromosome]] from the given parent [[Chromosome]]s,
+   * with this crossover algorithm.
+   */
+  def newChromo(father: Chromosome, mother: Chromosome) : Chromosome
+}
+
+case class EdgeTable(table: Map[JoinPlan, Seq[JoinPlan]])
 
 Review comment:
   Make it a inner class of `EdgeRecombination`? I don't think logically it's 
in the same level as other classes


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] Support Genetic Algorithm based join reorder

2019-08-06 Thread GitBox

yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] 
Support Genetic Algorithm based join reorder
URL: https://github.com/apache/spark/pull/24983#discussion_r311303978
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ##
 @@ -470,3 +397,427 @@ object JoinReorderDPFilters extends PredicateHelper {
  * extended with the set of connected/unconnected plans.
  */
 case class JoinGraphInfo (starJoins: Set[Int], nonStarJoins: Set[Int])
+
+/**
+ * Reorder the joins using a genetic algorithm. The algorithm treat the 
reorder problem
+ * to a traveling salesmen problem, and use genetic algorithm give an 
optimized solution.
+ *
+ * The implementation refs the geqo in postgresql, which is contibuted by 
Darrell Whitley:
+ * https://www.postgresql.org/docs/9.1/geqo-pg-intro.html
+ *
+ * For more info about genetic algorithm and the edge recombination crossover, 
pls see:
+ * "A Genetic Algorithm Tutorial, Darrell Whitley"
+ * https://link.springer.com/article/10.1007/BF00175354
+ * and "Scheduling Problems and Traveling Salesmen: The Genetic Edge 
Recombination Operator,
+ * Darrell Whitley et al." https://dl.acm.org/citation.cfm?id=657238
+ * respectively.
+ */
+object JoinReorderGA extends PredicateHelper with Logging {
+
+  def search(
+  conf: SQLConf,
+  items: Seq[LogicalPlan],
+  conditions: Set[Expression],
+  output: Seq[Attribute]): Option[LogicalPlan] = {
+
+val startTime = System.nanoTime()
+
+val itemsWithIndex = items.zipWithIndex.map {
+  case (plan, id) => id -> JoinPlan(Set(id), plan, Set.empty, Cost(0, 0))
+}.toMap
+
+val topOutputSet = AttributeSet(output)
+
+val pop = Population(conf, itemsWithIndex, conditions, topOutputSet).evolve
+
+val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000)
+logInfo(s"Join reordering finished. Duration: $durationInMs ms, number of 
items: " +
+s"${items.length}, number of plans in memo: ${ pop.chromos.size}")
+
+assert(pop.chromos.head.basicPlans.size == items.length)
+pop.chromos.head.integratedPlan match {
+  case Some(joinPlan) => joinPlan.plan match {
+case p @ Project(projectList, _: Join) if projectList != output =>
+  assert(topOutputSet == p.outputSet)
+  // Keep the same order of final output attributes.
+  Some(p.copy(projectList = output))
+case finalPlan if !sameOutput(finalPlan, output) =>
+  Some(Project(output, finalPlan))
+case finalPlan =>
+  Some(finalPlan)
+  }
+  case _ => None
+}
+  }
+}
+
+/**
+ * A pair of parent individuals can breed a child with certain crossover 
process.
+ * With crossover, child can inherit gene from its parents, and these gene 
snippets
+ * finally compose a new [[Chromosome]].
+ */
+@DeveloperApi
+trait Crossover {
+
+  /**
+   * Generate a new [[Chromosome]] from the given parent [[Chromosome]]s,
+   * with this crossover algorithm.
+   */
+  def newChromo(father: Chromosome, mother: Chromosome) : Chromosome
+}
+
+case class EdgeTable(table: Map[JoinPlan, Seq[JoinPlan]])
+
+/**
+ * This class implements the Genetic Edge Recombination algorithm.
+ * For more information about the Genetic Edge Recombination,
+ * see "Scheduling Problems and Traveling Salesmen: The Genetic Edge
+ * Recombination Operator" by Darrell Whitley et al.
+ * https://dl.acm.org/citation.cfm?id=657238
+ */
+object EdgeRecombination extends Crossover {
+
+  def genEdgeTable(father: Chromosome, mother: Chromosome) : EdgeTable = {
+val fatherTable = father.basicPlans.map(g => g -> 
findNeighbours(father.basicPlans, g)).toMap
+val motherTable = mother.basicPlans.map(g => g -> 
findNeighbours(mother.basicPlans, g)).toMap
+EdgeTable(
+  fatherTable.map(entry => entry._1 -> (entry._2 ++ 
motherTable(entry._1
+  }
+
+  def findNeighbours(genes: Seq[JoinPlan], g: JoinPlan) : Seq[JoinPlan] = {
+val genesIndexed = genes.toIndexedSeq
+val index = genesIndexed.indexOf(g)
+val length = genes.size
+if (index > 0 && index < length - 1) {
+  Seq(genesIndexed(index - 1), genesIndexed(index + 1))
+} else if (index == 0) {
+  Seq(genesIndexed(1), genesIndexed(length - 1))
+} else if (index == length - 1) {
+  Seq(genesIndexed(0), genesIndexed(length - 2))
+} else {
+  Seq()
+}
+  }
+
+  override def newChromo(father: Chromosome, mother: Chromosome): Chromosome = 
{
+var newGenes: Seq[JoinPlan] = Seq()
+// 1. Generate the edge table.
+var table = genEdgeTable(father, mother).table
+// 2. Choose a start point randomly from the heads of father/mother.
+var current =
+  if (util.Random.nextInt(2) == 0) father.basicPlans.head else 
mother.basicPlans.head
+newGenes :+= current
+
+var stop = false
+while (!stop) {
+  // 3. Filter out the chosen point from the edge table.
+  table = table.map(
+

[GitHub] [spark] yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] Support Genetic Algorithm based join reorder

2019-08-06 Thread GitBox

yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] 
Support Genetic Algorithm based join reorder
URL: https://github.com/apache/spark/pull/24983#discussion_r311304704
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ##
 @@ -470,3 +397,427 @@ object JoinReorderDPFilters extends PredicateHelper {
  * extended with the set of connected/unconnected plans.
  */
 case class JoinGraphInfo (starJoins: Set[Int], nonStarJoins: Set[Int])
+
+/**
+ * Reorder the joins using a genetic algorithm. The algorithm treat the 
reorder problem
+ * to a traveling salesmen problem, and use genetic algorithm give an 
optimized solution.
+ *
+ * The implementation refs the geqo in postgresql, which is contibuted by 
Darrell Whitley:
+ * https://www.postgresql.org/docs/9.1/geqo-pg-intro.html
+ *
+ * For more info about genetic algorithm and the edge recombination crossover, 
pls see:
+ * "A Genetic Algorithm Tutorial, Darrell Whitley"
+ * https://link.springer.com/article/10.1007/BF00175354
+ * and "Scheduling Problems and Traveling Salesmen: The Genetic Edge 
Recombination Operator,
+ * Darrell Whitley et al." https://dl.acm.org/citation.cfm?id=657238
+ * respectively.
+ */
+object JoinReorderGA extends PredicateHelper with Logging {
+
+  def search(
+  conf: SQLConf,
+  items: Seq[LogicalPlan],
+  conditions: Set[Expression],
+  output: Seq[Attribute]): Option[LogicalPlan] = {
+
+val startTime = System.nanoTime()
+
+val itemsWithIndex = items.zipWithIndex.map {
+  case (plan, id) => id -> JoinPlan(Set(id), plan, Set.empty, Cost(0, 0))
+}.toMap
+
+val topOutputSet = AttributeSet(output)
+
+val pop = Population(conf, itemsWithIndex, conditions, topOutputSet).evolve
+
+val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000)
+logInfo(s"Join reordering finished. Duration: $durationInMs ms, number of 
items: " +
+s"${items.length}, number of plans in memo: ${ pop.chromos.size}")
+
+assert(pop.chromos.head.basicPlans.size == items.length)
+pop.chromos.head.integratedPlan match {
+  case Some(joinPlan) => joinPlan.plan match {
+case p @ Project(projectList, _: Join) if projectList != output =>
+  assert(topOutputSet == p.outputSet)
+  // Keep the same order of final output attributes.
+  Some(p.copy(projectList = output))
+case finalPlan if !sameOutput(finalPlan, output) =>
+  Some(Project(output, finalPlan))
+case finalPlan =>
+  Some(finalPlan)
+  }
+  case _ => None
+}
+  }
+}
+
+/**
+ * A pair of parent individuals can breed a child with certain crossover 
process.
+ * With crossover, child can inherit gene from its parents, and these gene 
snippets
+ * finally compose a new [[Chromosome]].
+ */
+@DeveloperApi
+trait Crossover {
+
+  /**
+   * Generate a new [[Chromosome]] from the given parent [[Chromosome]]s,
+   * with this crossover algorithm.
+   */
+  def newChromo(father: Chromosome, mother: Chromosome) : Chromosome
+}
+
+case class EdgeTable(table: Map[JoinPlan, Seq[JoinPlan]])
 
 Review comment:
   BTW, there's no need to introduce an extra case class. We can simply do 
`type EdgeTable = Map[JoinPlan, Seq[JoinPlan]]`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables.

2019-08-06 Thread GitBox

SparkQA commented on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE 
for Data Source V2 Tables.
URL: https://github.com/apache/spark/pull/25040#issuecomment-518869859
 
 
   **[Test build #108729 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108729/testReport)**
 for PR 25040 at commit 
[`cff78a1`](https://github.com/apache/spark/commit/cff78a16e691917e812b4cd63bf7544a54af4742).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables.

2019-08-06 Thread GitBox

SparkQA removed a comment on issue #25040: [SPARK-28238][SQL] Implement 
DESCRIBE TABLE for Data Source V2 Tables.
URL: https://github.com/apache/spark/pull/25040#issuecomment-518802375
 
 
   **[Test build #108729 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108729/testReport)**
 for PR 25040 at commit 
[`cff78a1`](https://github.com/apache/spark/commit/cff78a16e691917e812b4cd63bf7544a54af4742).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zsxwing commented on a change in pull request #25135: [SPARK-28367][SS] Use new KafkaConsumer.poll API in Kafka connector

2019-08-06 Thread GitBox

zsxwing commented on a change in pull request #25135: [SPARK-28367][SS] Use new 
KafkaConsumer.poll API in Kafka connector
URL: https://github.com/apache/spark/pull/25135#discussion_r311305080
 
 

 ##
 File path: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala
 ##
 @@ -419,6 +416,21 @@ private[kafka010] class KafkaOffsetReader(
 stopConsumer()
 _consumer = null  // will automatically get reinitialized again
   }
+
+  private def getPartitions(): ju.Set[TopicPartition] = {
+consumer.poll(jt.Duration.ZERO)
+var partitions = consumer.assignment()
+val startTimeMs = System.currentTimeMillis()
+while (partitions.isEmpty && System.currentTimeMillis() - startTimeMs < 
pollTimeoutMs) {
+  // Poll to get the latest assigned partitions
+  consumer.poll(jt.Duration.ofMillis(100))
 
 Review comment:
   So using this new API will pull data to driver. Right? The previous 
`poll(0)` is basically a hack to avoid polling data in driver. Maybe we should 
ask the Kafka community to add a new API to pull metadata only.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zsxwing commented on a change in pull request #25135: [SPARK-28367][SS] Use new KafkaConsumer.poll API in Kafka connector

2019-08-06 Thread GitBox

zsxwing commented on a change in pull request #25135: [SPARK-28367][SS] Use new 
KafkaConsumer.poll API in Kafka connector
URL: https://github.com/apache/spark/pull/25135#discussion_r311305080
 
 

 ##
 File path: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala
 ##
 @@ -419,6 +416,21 @@ private[kafka010] class KafkaOffsetReader(
 stopConsumer()
 _consumer = null  // will automatically get reinitialized again
   }
+
+  private def getPartitions(): ju.Set[TopicPartition] = {
+consumer.poll(jt.Duration.ZERO)
+var partitions = consumer.assignment()
+val startTimeMs = System.currentTimeMillis()
+while (partitions.isEmpty && System.currentTimeMillis() - startTimeMs < 
pollTimeoutMs) {
+  // Poll to get the latest assigned partitions
+  consumer.poll(jt.Duration.ofMillis(100))
 
 Review comment:
   So using this new API will pull data to driver. Right? The previous 
`poll(0)` is basically a hack to avoid fetching data to driver. Maybe we should 
ask the Kafka community to add a new API to pull metadata only.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables.

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE 
TABLE for Data Source V2 Tables.
URL: https://github.com/apache/spark/pull/25040#issuecomment-518870398
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/108729/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables.

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE 
TABLE for Data Source V2 Tables.
URL: https://github.com/apache/spark/pull/25040#issuecomment-518870396
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables.

2019-08-06 Thread GitBox

AmplabJenkins removed a comment on issue #25040: [SPARK-28238][SQL] Implement 
DESCRIBE TABLE for Data Source V2 Tables.
URL: https://github.com/apache/spark/pull/25040#issuecomment-518870396
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #25040: [SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables.

2019-08-06 Thread GitBox

AmplabJenkins removed a comment on issue #25040: [SPARK-28238][SQL] Implement 
DESCRIBE TABLE for Data Source V2 Tables.
URL: https://github.com/apache/spark/pull/25040#issuecomment-518870398
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/108729/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] skonto commented on issue #25229: [SPARK-27900][K8s] Add jvm oom flag

2019-08-06 Thread GitBox

skonto commented on issue #25229: [SPARK-27900][K8s] Add jvm oom flag
URL: https://github.com/apache/spark/pull/25229#issuecomment-518871183
 
 
   @Dooyoung-Hwang that was one of my options. So this PR was intended to 
protect the user from the spark issues when things dont go well and the driver 
does not exit on K8s. Anyway some people asked this as well on the related 
discussion: https://github.com/apache/spark/pull/24796. Anyway works for me 
what is the best UX?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] skonto edited a comment on issue #25229: [SPARK-27900][K8s] Add jvm oom flag

2019-08-06 Thread GitBox

skonto edited a comment on issue #25229: [SPARK-27900][K8s] Add jvm oom flag
URL: https://github.com/apache/spark/pull/25229#issuecomment-518871183
 
 
   @Dooyoung-Hwang that was one of my options. So this PR was intended to 
protect the user from the spark issues when things dont go well and the driver 
does not exit on K8s. Anyway some people asked this as well on the related 
discussion: https://github.com/apache/spark/pull/24796. I dont mind adding a 
note to the docs, what is the best UX?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] Support Genetic Algorithm based join reorder

2019-08-06 Thread GitBox

yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] 
Support Genetic Algorithm based join reorder
URL: https://github.com/apache/spark/pull/24983#discussion_r311307700
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ##
 @@ -470,3 +397,427 @@ object JoinReorderDPFilters extends PredicateHelper {
  * extended with the set of connected/unconnected plans.
  */
 case class JoinGraphInfo (starJoins: Set[Int], nonStarJoins: Set[Int])
+
+/**
+ * Reorder the joins using a genetic algorithm. The algorithm treat the 
reorder problem
+ * to a traveling salesmen problem, and use genetic algorithm give an 
optimized solution.
+ *
+ * The implementation refs the geqo in postgresql, which is contibuted by 
Darrell Whitley:
+ * https://www.postgresql.org/docs/9.1/geqo-pg-intro.html
+ *
+ * For more info about genetic algorithm and the edge recombination crossover, 
pls see:
+ * "A Genetic Algorithm Tutorial, Darrell Whitley"
+ * https://link.springer.com/article/10.1007/BF00175354
+ * and "Scheduling Problems and Traveling Salesmen: The Genetic Edge 
Recombination Operator,
+ * Darrell Whitley et al." https://dl.acm.org/citation.cfm?id=657238
+ * respectively.
+ */
+object JoinReorderGA extends PredicateHelper with Logging {
+
+  def search(
+  conf: SQLConf,
+  items: Seq[LogicalPlan],
+  conditions: Set[Expression],
+  output: Seq[Attribute]): Option[LogicalPlan] = {
+
+val startTime = System.nanoTime()
+
+val itemsWithIndex = items.zipWithIndex.map {
+  case (plan, id) => id -> JoinPlan(Set(id), plan, Set.empty, Cost(0, 0))
+}.toMap
+
+val topOutputSet = AttributeSet(output)
+
+val pop = Population(conf, itemsWithIndex, conditions, topOutputSet).evolve
+
+val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000)
+logInfo(s"Join reordering finished. Duration: $durationInMs ms, number of 
items: " +
+s"${items.length}, number of plans in memo: ${ pop.chromos.size}")
+
+assert(pop.chromos.head.basicPlans.size == items.length)
+pop.chromos.head.integratedPlan match {
+  case Some(joinPlan) => joinPlan.plan match {
+case p @ Project(projectList, _: Join) if projectList != output =>
+  assert(topOutputSet == p.outputSet)
+  // Keep the same order of final output attributes.
+  Some(p.copy(projectList = output))
+case finalPlan if !sameOutput(finalPlan, output) =>
+  Some(Project(output, finalPlan))
+case finalPlan =>
+  Some(finalPlan)
+  }
+  case _ => None
+}
+  }
+}
+
+/**
+ * A pair of parent individuals can breed a child with certain crossover 
process.
+ * With crossover, child can inherit gene from its parents, and these gene 
snippets
+ * finally compose a new [[Chromosome]].
+ */
+@DeveloperApi
+trait Crossover {
+
+  /**
+   * Generate a new [[Chromosome]] from the given parent [[Chromosome]]s,
+   * with this crossover algorithm.
+   */
+  def newChromo(father: Chromosome, mother: Chromosome) : Chromosome
+}
+
+case class EdgeTable(table: Map[JoinPlan, Seq[JoinPlan]])
+
+/**
+ * This class implements the Genetic Edge Recombination algorithm.
+ * For more information about the Genetic Edge Recombination,
+ * see "Scheduling Problems and Traveling Salesmen: The Genetic Edge
+ * Recombination Operator" by Darrell Whitley et al.
+ * https://dl.acm.org/citation.cfm?id=657238
+ */
+object EdgeRecombination extends Crossover {
+
+  def genEdgeTable(father: Chromosome, mother: Chromosome) : EdgeTable = {
+val fatherTable = father.basicPlans.map(g => g -> 
findNeighbours(father.basicPlans, g)).toMap
+val motherTable = mother.basicPlans.map(g => g -> 
findNeighbours(mother.basicPlans, g)).toMap
+EdgeTable(
+  fatherTable.map(entry => entry._1 -> (entry._2 ++ 
motherTable(entry._1
+  }
+
+  def findNeighbours(genes: Seq[JoinPlan], g: JoinPlan) : Seq[JoinPlan] = {
+val genesIndexed = genes.toIndexedSeq
+val index = genesIndexed.indexOf(g)
+val length = genes.size
+if (index > 0 && index < length - 1) {
+  Seq(genesIndexed(index - 1), genesIndexed(index + 1))
+} else if (index == 0) {
+  Seq(genesIndexed(1), genesIndexed(length - 1))
+} else if (index == length - 1) {
+  Seq(genesIndexed(0), genesIndexed(length - 2))
+} else {
+  Seq()
+}
+  }
+
+  override def newChromo(father: Chromosome, mother: Chromosome): Chromosome = 
{
+var newGenes: Seq[JoinPlan] = Seq()
+// 1. Generate the edge table.
+var table = genEdgeTable(father, mother).table
+// 2. Choose a start point randomly from the heads of father/mother.
+var current =
+  if (util.Random.nextInt(2) == 0) father.basicPlans.head else 
mother.basicPlans.head
+newGenes :+= current
+
+var stop = false
+while (!stop) {
+  // 3. Filter out the chosen point from the edge table.
+  table = table.map(
+

[GitHub] [spark] viirya commented on a change in pull request #25360: [SPARK-28280][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into group by clause in 'udf-group-by.sql'

2019-08-06 Thread GitBox

viirya commented on a change in pull request #25360: 
[SPARK-28280][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into group by clause 
in 'udf-group-by.sql'
URL: https://github.com/apache/spark/pull/25360#discussion_r311308029
 
 

 ##
 File path: sql/core/src/test/resources/sql-tests/inputs/udf/udf-group-by.sql
 ##
 @@ -20,29 +20,25 @@ SELECT 'foo', COUNT(udf(a)) FROM testData GROUP BY 1;
 SELECT 'foo' FROM testData WHERE a = 0 GROUP BY udf(1);
 
 -- Aggregate grouped by literals (hash aggregate).
-SELECT 'foo', udf(APPROX_COUNT_DISTINCT(udf(a))) FROM testData WHERE a = 0 
GROUP BY 1;
+SELECT 'foo', udf(APPROX_COUNT_DISTINCT(udf(a))) FROM testData WHERE a = 0 
GROUP BY udf(1);
 
 -- Aggregate grouped by literals (sort aggregate).
-SELECT 'foo', MAX(STRUCT(udf(a))) FROM testData WHERE a = 0 GROUP BY 1;
+SELECT 'foo', MAX(STRUCT(udf(a))) FROM testData WHERE a = 0 GROUP BY udf(1);
 
 -- Aggregate with complex GroupBy expressions.
 SELECT udf(a + b), udf(COUNT(b)) FROM testData GROUP BY a + b;
 SELECT udf(a + 2), udf(COUNT(b)) FROM testData GROUP BY a + 1;
-
--- [SPARK-28445] Inconsistency between Scala and Python/Panda udfs when 
groupby with udf() is used
--- The following query will make Scala UDF work, but Python and Pandas udfs 
will fail with an AnalysisException.
--- The query should be added after SPARK-28445.
--- SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1);
+SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1);
 
 Review comment:
   From the diff, I think original query is `SELECT a + 1 + 1, COUNT(b) FROM 
testData GROUP BY a + 1`, why not `SELECT udf(a + 1) + 1, udf(COUNT(b)) FROM 
testData GROUP BY udf(a + 1)` here? So the result should not be a diff?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jzhuge opened a new pull request #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined

2019-08-06 Thread GitBox

jzhuge opened a new pull request #25372: [SPARK-28640][SQL] Only give warning 
when session catalog is not defined
URL: https://github.com/apache/spark/pull/25372
 
 
   ## What changes were proposed in this pull request?
   
   LookupCatalog.sessionCatalog logs an error message and the exception stack 
upon any nonfatal exception.
   When session catalog is not defined, this may alarm the user unnecessarily.
   It should be enough to give a warning and return None.
   
   ## How was this patch tested?
   
   Manually run `spark.sessionState.analyzer.sessionCatalog` in Spark shell,
   expect a warning, not an error and a stack trace.
   ```
   scala> spark.sessionState.analyzer.sessionCatalog
   2019-08-06 15:43:23,068 WARN  [main] hive.HiveSessionStateBuilder$$anon$1 
(Logging.scala:logWarning(69)) - Session catalog is not defined
   res1: Option[org.apache.spark.sql.catalog.v2.CatalogPlugin] = None
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #25372: [SPARK-28640][SQL] Only give warning 
when session catalog is not defined
URL: https://github.com/apache/spark/pull/25372#issuecomment-518873586
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/13815/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #25372: [SPARK-28640][SQL] Only give warning 
when session catalog is not defined
URL: https://github.com/apache/spark/pull/25372#issuecomment-518873584
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined

2019-08-06 Thread GitBox

SparkQA commented on issue #25372: [SPARK-28640][SQL] Only give warning when 
session catalog is not defined
URL: https://github.com/apache/spark/pull/25372#issuecomment-518874024
 
 
   **[Test build #108733 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108733/testReport)**
 for PR 25372 at commit 
[`8dd1feb`](https://github.com/apache/spark/commit/8dd1feba8231dc0b73e08935997f4acd8eb957d6).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined

2019-08-06 Thread GitBox

AmplabJenkins removed a comment on issue #25372: [SPARK-28640][SQL] Only give 
warning when session catalog is not defined
URL: https://github.com/apache/spark/pull/25372#issuecomment-518873586
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/13815/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined

2019-08-06 Thread GitBox

AmplabJenkins removed a comment on issue #25372: [SPARK-28640][SQL] Only give 
warning when session catalog is not defined
URL: https://github.com/apache/spark/pull/25372#issuecomment-518873584
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #25372: [SPARK-28640][SQL] Only give warning when session catalog is not defined

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #25372: [SPARK-28640][SQL] Only give warning 
when session catalog is not defined
URL: https://github.com/apache/spark/pull/25372#issuecomment-518875170
 
 
   Can one of the admins verify this patch?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base

2019-08-06 Thread GitBox

SparkQA commented on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert 
and port 'pgSQL/join.sql' into UDF test base
URL: https://github.com/apache/spark/pull/25340#issuecomment-518881881
 
 
   **[Test build #108731 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108731/testReport)**
 for PR 25340 at commit 
[`fd9a8fb`](https://github.com/apache/spark/commit/fd9a8fb3543475558d9e80e8791baa6c908e15b8).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base

2019-08-06 Thread GitBox

SparkQA removed a comment on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] 
Convert and port 'pgSQL/join.sql' into UDF test base
URL: https://github.com/apache/spark/pull/25340#issuecomment-518827893
 
 
   **[Test build #108731 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108731/testReport)**
 for PR 25340 at commit 
[`fd9a8fb`](https://github.com/apache/spark/commit/fd9a8fb3543475558d9e80e8791baa6c908e15b8).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #25331: [SPARK-27768][SQL] Infinity, -Infinity, NaN should be recognized in a case insensitive manner.

2019-08-06 Thread GitBox

maropu commented on a change in pull request #25331: [SPARK-27768][SQL] 
Infinity, -Infinity, NaN should be recognized in a case insensitive manner.
URL: https://github.com/apache/spark/pull/25331#discussion_r311317299
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ##
 @@ -562,8 +593,12 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
   // FloatConverter
   private[this] def castToFloat(from: DataType): Any => Any = from match {
 case StringType =>
-  buildCast[UTF8String](_, s => try s.toString.toFloat catch {
-case _: NumberFormatException => null
+  buildCast[UTF8String](_, s => {
+val floatStr = s.toString
+try floatStr.toFloat catch {
+  case _: NumberFormatException =>
 
 Review comment:
   Could we check the numbers by a simple query?, e.g.,
   ```
   scala> spark.range(10).selectExpr("CAST(double(id) AS STRING) 
a").write.save("/tmp/test")
   scala> spark.read.load("/tmp/test").selectExpr("CAST(a AS 
DOUBLE)").write.format("noop").save()
   ```
   In another pr, I observed that a logic depending on exceptions cause high 
performance penalties: https://github.com/lz4/lz4-java/pull/143
   So, I'm a bit worried that this current logic has the same issue.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] 
Convert and port 'pgSQL/join.sql' into UDF test base
URL: https://github.com/apache/spark/pull/25340#issuecomment-518882255
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] 
Convert and port 'pgSQL/join.sql' into UDF test base
URL: https://github.com/apache/spark/pull/25340#issuecomment-518882256
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/108731/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base

2019-08-06 Thread GitBox

AmplabJenkins removed a comment on issue #25340: 
[SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF 
test base
URL: https://github.com/apache/spark/pull/25340#issuecomment-518882256
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/108731/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #25340: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base

2019-08-06 Thread GitBox

AmplabJenkins removed a comment on issue #25340: 
[SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF 
test base
URL: https://github.com/apache/spark/pull/25340#issuecomment-518882255
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #25344: [WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect

2019-08-06 Thread GitBox

maropu commented on a change in pull request #25344: [WIP][SPARK-28151][SQL] 
Mapped ByteType to TinyINT for MsSQLServerDialect
URL: https://github.com/apache/spark/pull/25344#discussion_r311319686
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ##
 @@ -550,7 +550,7 @@ object JdbcUtils extends Logging {
 
 case ByteType =>
   (stmt: PreparedStatement, row: Row, pos: Int) =>
-stmt.setInt(pos + 1, row.getByte(pos))
+stmt.setByte(pos + 1, row.getByte(pos))
 
 Review comment:
   we don't have any test for this code path now?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] nvander1 edited a comment on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API

2019-08-06 Thread GitBox

nvander1 edited a comment on issue #24232: [SPARK-27297] [SQL] Add higher order 
functions to scala API
URL: https://github.com/apache/spark/pull/24232#issuecomment-517880269
 
 
   @HyukjinKwon I think it's all good now. Added forall from  
https://github.com/apache/spark/pull/24761 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API

2019-08-06 Thread GitBox

AmplabJenkins removed a comment on issue #24232: [SPARK-27297] [SQL] Add higher 
order functions to scala API
URL: https://github.com/apache/spark/pull/24232#issuecomment-518884999
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on issue #25215: [SPARK-28445][SQL][PYTHON] Fix error when PythonUDF is used in both group by and aggregate expression

2019-08-06 Thread GitBox

viirya commented on issue #25215: [SPARK-28445][SQL][PYTHON] Fix error when 
PythonUDF is used in both group by and aggregate expression
URL: https://github.com/apache/spark/pull/25215#issuecomment-518884995
 
 
   @shivusondur @HyukjinKwon The analysis exception by adding udf to group by, 
is caused by SPARK-28386, SPARK-26741.
   
   ```
   == Analyzed Logical Plan ==
   org.apache.spark.sql.AnalysisException: cannot resolve '`b`' given input 
columns: [CAST(udf(cast(b as string)) AS INT), CAST(udf(cast(c as string)) AS 
STRING)]; line 2 pos 63;
   'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
   +- Project [CAST(udf(cast(b as string)) AS INT)#x, CAST(udf(cast(c as 
string)) AS STRING)#x]
  +- Filter (cast(udf(cast(count(1)#xL as string)) as bigint) = cast(1 as 
bigint))
 +- Aggregate [cast(udf(cast(b#x as string)) as int), cast(udf(cast(c#x 
as string)) as string)], [cast(udf(cast(b#x as string)) as int) AS 
CAST(udf(cast(b as string)) AS INT)#x, cast(udf(cast(c#x as string)) as string) 
AS CAST(udf(cast(c as string)) AS STRING)#x, count(1) AS count(1)#xL]
+- SubqueryAlias `default`.`test_having`
   +- Relation[a#x,b#x,c#x,d#x] parquet
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #24232: [SPARK-27297] [SQL] Add higher order 
functions to scala API
URL: https://github.com/apache/spark/pull/24232#issuecomment-518884999
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #24232: [SPARK-27297] [SQL] Add higher order 
functions to scala API
URL: https://github.com/apache/spark/pull/24232#issuecomment-518885010
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/13816/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API

2019-08-06 Thread GitBox

AmplabJenkins removed a comment on issue #24232: [SPARK-27297] [SQL] Add higher 
order functions to scala API
URL: https://github.com/apache/spark/pull/24232#issuecomment-518885010
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/13816/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API

2019-08-06 Thread GitBox

SparkQA commented on issue #24232: [SPARK-27297] [SQL] Add higher order 
functions to scala API
URL: https://github.com/apache/spark/pull/24232#issuecomment-518885405
 
 
   **[Test build #108734 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108734/testReport)**
 for PR 24232 at commit 
[`a8c7ecd`](https://github.com/apache/spark/commit/a8c7ecd27b8d0fcabfd86571eeba801bb5c7e62a).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on issue #25366: [SPARK-27924][SQL][TEST][FOLLOW-UP] Open comment about boolean test

2019-08-06 Thread GitBox

maropu commented on issue #25366: [SPARK-27924][SQL][TEST][FOLLOW-UP] Open 
comment about boolean test
URL: https://github.com/apache/spark/pull/25366#issuecomment-518885803
 
 
   Does https://github.com/apache/spark/pull/25357 afffects the output in this 
pr? If so, I think we'd be better to merge that pr first to make code changes 
less.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dilipbiswal commented on a change in pull request #25331: [SPARK-27768][SQL] Infinity, -Infinity, NaN should be recognized in a case insensitive manner.

2019-08-06 Thread GitBox

dilipbiswal commented on a change in pull request #25331: [SPARK-27768][SQL] 
Infinity, -Infinity, NaN should be recognized in a case insensitive manner.
URL: https://github.com/apache/spark/pull/25331#discussion_r311322905
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ##
 @@ -562,8 +593,12 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
   // FloatConverter
   private[this] def castToFloat(from: DataType): Any => Any = from match {
 case StringType =>
-  buildCast[UTF8String](_, s => try s.toString.toFloat catch {
-case _: NumberFormatException => null
+  buildCast[UTF8String](_, s => {
+val floatStr = s.toString
+try floatStr.toFloat catch {
+  case _: NumberFormatException =>
 
 Review comment:
   @maropu Ok.. i will give it a try. One thing to note here is that, we didn't 
introduce a try/catch in this PR. That was existing before. We just added some 
extra code in the catch block. However, i will give it a try and get back.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #23701: [SPARK-26741][SQL] Allow using aggregate expressions in ORDER BY clause

2019-08-06 Thread GitBox

viirya commented on a change in pull request #23701: [SPARK-26741][SQL] Allow 
using aggregate expressions in ORDER BY clause
URL: https://github.com/apache/spark/pull/23701#discussion_r311321824
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ##
 @@ -1722,22 +1723,28 @@ class Analyzer(
   case ae: AnalysisException => f
 }
 
-  case sort @ Sort(sortOrder, global, aggregate: Aggregate) if 
aggregate.resolved =>
-
+  case sort @ Sort(sortOrder, global, child)
+  if child.resolved && relatedAggregate(child).isDefined =>
+// The Aggregate plan may be not a direct child of sort when there is 
a HAVING clause.
+// In that case a `Filter` and/or a `Project` can be present between 
the `Sort` and the
+// related `Aggregate`.
+val aggregate = relatedAggregate(child).get
 // Try resolving the ordering as though it is in the aggregate clause.
 try {
   // If a sort order is unresolved, containing references not in 
aggregate, or containing
   // `AggregateExpression`, we need to push down it to the underlying 
aggregate operator.
   val unresolvedSortOrders = sortOrder.filter { s =>
-!s.resolved || !s.references.subsetOf(aggregate.outputSet) || 
containsAggregate(s)
+!s.resolved || !s.references.subsetOf(child.outputSet) || 
containsAggregate(s)
   }
-  val aliasedOrdering =
-unresolvedSortOrders.map(o => Alias(o.child, "aggOrder")())
-  val aggregatedOrdering = aggregate.copy(aggregateExpressions = 
aliasedOrdering)
+  val namedExpressionsOrdering =
+unresolvedSortOrders.map(_.child match {
+  case a: Attribute => a
 
 Review comment:
   This `Attribute` case is new. Previously we just alias aggregate expression, 
why need for attribute here?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] Support Genetic Algorithm based join reorder

2019-08-06 Thread GitBox

yeshengm commented on a change in pull request #24983: [SPARK-27714][SQL][CBO] 
Support Genetic Algorithm based join reorder
URL: https://github.com/apache/spark/pull/24983#discussion_r311324313
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ##
 @@ -470,3 +397,427 @@ object JoinReorderDPFilters extends PredicateHelper {
  * extended with the set of connected/unconnected plans.
  */
 case class JoinGraphInfo (starJoins: Set[Int], nonStarJoins: Set[Int])
+
+/**
+ * Reorder the joins using a genetic algorithm. The algorithm treat the 
reorder problem
+ * to a traveling salesmen problem, and use genetic algorithm give an 
optimized solution.
+ *
+ * The implementation refs the geqo in postgresql, which is contibuted by 
Darrell Whitley:
+ * https://www.postgresql.org/docs/9.1/geqo-pg-intro.html
+ *
+ * For more info about genetic algorithm and the edge recombination crossover, 
pls see:
+ * "A Genetic Algorithm Tutorial, Darrell Whitley"
+ * https://link.springer.com/article/10.1007/BF00175354
+ * and "Scheduling Problems and Traveling Salesmen: The Genetic Edge 
Recombination Operator,
+ * Darrell Whitley et al." https://dl.acm.org/citation.cfm?id=657238
+ * respectively.
+ */
+object JoinReorderGA extends PredicateHelper with Logging {
+
+  def search(
+  conf: SQLConf,
+  items: Seq[LogicalPlan],
+  conditions: Set[Expression],
+  output: Seq[Attribute]): Option[LogicalPlan] = {
+
+val startTime = System.nanoTime()
+
+val itemsWithIndex = items.zipWithIndex.map {
+  case (plan, id) => id -> JoinPlan(Set(id), plan, Set.empty, Cost(0, 0))
+}.toMap
+
+val topOutputSet = AttributeSet(output)
+
+val pop = Population(conf, itemsWithIndex, conditions, topOutputSet).evolve
+
+val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000)
+logInfo(s"Join reordering finished. Duration: $durationInMs ms, number of 
items: " +
+s"${items.length}, number of plans in memo: ${ pop.chromos.size}")
+
+assert(pop.chromos.head.basicPlans.size == items.length)
+pop.chromos.head.integratedPlan match {
+  case Some(joinPlan) => joinPlan.plan match {
+case p @ Project(projectList, _: Join) if projectList != output =>
+  assert(topOutputSet == p.outputSet)
+  // Keep the same order of final output attributes.
+  Some(p.copy(projectList = output))
+case finalPlan if !sameOutput(finalPlan, output) =>
+  Some(Project(output, finalPlan))
+case finalPlan =>
+  Some(finalPlan)
+  }
+  case _ => None
+}
+  }
+}
+
+/**
+ * A pair of parent individuals can breed a child with certain crossover 
process.
+ * With crossover, child can inherit gene from its parents, and these gene 
snippets
+ * finally compose a new [[Chromosome]].
+ */
+@DeveloperApi
+trait Crossover {
+
+  /**
+   * Generate a new [[Chromosome]] from the given parent [[Chromosome]]s,
+   * with this crossover algorithm.
+   */
+  def newChromo(father: Chromosome, mother: Chromosome) : Chromosome
+}
+
+case class EdgeTable(table: Map[JoinPlan, Seq[JoinPlan]])
+
+/**
+ * This class implements the Genetic Edge Recombination algorithm.
+ * For more information about the Genetic Edge Recombination,
+ * see "Scheduling Problems and Traveling Salesmen: The Genetic Edge
+ * Recombination Operator" by Darrell Whitley et al.
+ * https://dl.acm.org/citation.cfm?id=657238
+ */
+object EdgeRecombination extends Crossover {
 
 Review comment:
   Please give a one-or-two-sentence definition of EdgeRecombination? Also give 
a simple example here about how an edge map is constructed and a new path is 
generated from two paths?
   
   I feel like the example in this paper "The Traveling Salesman and Sequence 
Scheduling: Quality Solutions Using Genetic Edge Recombination" is fairly 
straightforward.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on issue #25347: [SPARK-28610][SQL] Allow having a decimal buffer for long sum

2019-08-06 Thread GitBox

maropu commented on issue #25347: [SPARK-28610][SQL] Allow having a decimal 
buffer for long sum
URL: https://github.com/apache/spark/pull/25347#issuecomment-518889291
 
 
   Yea, I think so. If we have a new flag, IMO it'd be better that spark has 
the same behivour with the others.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs

2019-08-06 Thread GitBox

jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] 
DataFrameWriter saveAsTable support for V2 catalogs
URL: https://github.com/apache/spark/pull/25330#discussion_r311311567
 
 

 ##
 File path: 
sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2DataFrameSuite.scala
 ##
 @@ -141,4 +141,66 @@ class DataSourceV2DataFrameSuite extends QueryTest with 
SharedSQLContext with Be
   }
 }
   }
+
+  testQuietly("saveAsTable: with defined catalog and table doesn't exist") {
 
 Review comment:
   This and many other tests don't need `with defined catalog` in test name any 
more since v2 session tests are split into a separate file.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs

2019-08-06 Thread GitBox

jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] 
DataFrameWriter saveAsTable support for V2 catalogs
URL: https://github.com/apache/spark/pull/25330#discussion_r311311184
 
 

 ##
 File path: 
sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2DataFrameSuite.scala
 ##
 @@ -19,15 +19,15 @@ package org.apache.spark.sql.sources.v2
 
 import org.scalatest.BeforeAndAfter
 
-import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.{QueryTest, Row}
 import org.apache.spark.sql.internal.SQLConf.{PARTITION_OVERWRITE_MODE, 
PartitionOverwriteMode}
 import org.apache.spark.sql.test.SharedSQLContext
 
 class DataSourceV2DataFrameSuite extends QueryTest with SharedSQLContext with 
BeforeAndAfter {
   import testImplicits._
 
   before {
-spark.conf.set("spark.sql.catalog.testcat", 
classOf[TestInMemoryTableCatalog].getName)
+spark.conf.set(s"spark.sql.catalog.testcat", 
classOf[TestInMemoryTableCatalog].getName)
 
 Review comment:
   Needed?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs

2019-08-06 Thread GitBox

jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] 
DataFrameWriter saveAsTable support for V2 catalogs
URL: https://github.com/apache/spark/pull/25330#discussion_r311324593
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ##
 @@ -648,8 +648,11 @@ class Analyzer(
   if catalog.isTemporaryTable(ident) =>
 u // temporary views take precedence over catalog table names
 
-  case u @ UnresolvedRelation(CatalogObjectIdentifier(Some(catalogPlugin), 
ident)) =>
-loadTable(catalogPlugin, 
ident).map(DataSourceV2Relation.create).getOrElse(u)
+  case u @ UnresolvedRelation(CatalogObjectIdentifier(maybeCatalog, 
ident)) =>
+maybeCatalog.orElse(sessionCatalog)
 
 Review comment:
   This changes the SELECT path, so more unit tests for table relation are 
needed, e.g.:
   ```
 test("Relation: basic") {
   val t1 = "tbl"
   withTable(t1) {
 sql(s"CREATE TABLE $t1 USING $v2Format AS SELECT id, data FROM source")
 checkAnswer(spark.table(t1), spark.table("source"))
 checkAnswer(sql(s"TABLE $t1"), spark.table("source"))
 checkAnswer(sql(s"SELECT * FROM $t1"), spark.table("source"))
   }
 }
   ```
   
   IMHO this change v2 session catalog and related tests in 
`DataSourceV2SessionDataFrameSuite` can be split to a separate PR with 
additional unit tests for SQL SELECT and spark.table().
   
   Another PR to update `ResolveInsertInto` to support v2 session catalog in 
SQL INSERT INTO and DFW.insertInto.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs

2019-08-06 Thread GitBox

jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] 
DataFrameWriter saveAsTable support for V2 catalogs
URL: https://github.com/apache/spark/pull/25330#discussion_r311310995
 
 

 ##
 File path: 
sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceDFWriterV2SessionCatalogSuite.scala
 ##
 @@ -0,0 +1,146 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2
+
+import java.util
+import java.util.concurrent.ConcurrentHashMap
+
+import scala.collection.JavaConverters._
+
+import org.scalatest.BeforeAndAfter
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.catalog.v2.Identifier
+import org.apache.spark.sql.catalog.v2.expressions.Transform
+import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException
+import org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+class DataSourceDFWriterV2SessionCatalogSuite
 
 Review comment:
   How about `DataSourceV2SessionDataFrameSuite`? Match the name 
`DataSourceV2DataFrameSuite` better.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #25331: [SPARK-27768][SQL] Infinity, -Infinity, NaN should be recognized in a case insensitive manner.

2019-08-06 Thread GitBox

maropu commented on a change in pull request #25331: [SPARK-27768][SQL] 
Infinity, -Infinity, NaN should be recognized in a case insensitive manner.
URL: https://github.com/apache/spark/pull/25331#discussion_r311325089
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ##
 @@ -562,8 +593,12 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
   // FloatConverter
   private[this] def castToFloat(from: DataType): Any => Any = from match {
 case StringType =>
-  buildCast[UTF8String](_, s => try s.toString.toFloat catch {
-case _: NumberFormatException => null
+  buildCast[UTF8String](_, s => {
+val floatStr = s.toString
+try floatStr.toFloat catch {
+  case _: NumberFormatException =>
 
 Review comment:
   Yea, I mean the performance for known strings as @viirya said.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs

2019-08-06 Thread GitBox

jzhuge commented on a change in pull request #25330: [SPARK-28565][SQL] 
DataFrameWriter saveAsTable support for V2 catalogs
URL: https://github.com/apache/spark/pull/25330#discussion_r311324593
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ##
 @@ -648,8 +648,11 @@ class Analyzer(
   if catalog.isTemporaryTable(ident) =>
 u // temporary views take precedence over catalog table names
 
-  case u @ UnresolvedRelation(CatalogObjectIdentifier(Some(catalogPlugin), 
ident)) =>
-loadTable(catalogPlugin, 
ident).map(DataSourceV2Relation.create).getOrElse(u)
+  case u @ UnresolvedRelation(CatalogObjectIdentifier(maybeCatalog, 
ident)) =>
+maybeCatalog.orElse(sessionCatalog)
 
 Review comment:
   This changes the SELECT path, so more unit tests for table relation are 
needed, e.g.:
   ```
 test("Relation: basic") {
   val t1 = "tbl"
   withTable(t1) {
 sql(s"CREATE TABLE $t1 USING $v2Format AS SELECT id, data FROM source")
 checkAnswer(spark.table(t1), spark.table("source"))
 checkAnswer(sql(s"TABLE $t1"), spark.table("source"))
 checkAnswer(sql(s"SELECT * FROM $t1"), spark.table("source"))
   }
 }
   ```
   
   IMHO this v2 session catalog support and related tests can be split to a 
separate PR with additional unit tests for SQL SELECT and spark.table().
   
   Also another PR to update `ResolveInsertInto` to support v2 session catalog 
in SQL INSERT INTO and DFW.insertInto.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #25360: [SPARK-28280][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into group by clause in 'udf-group-by.sql'

2019-08-06 Thread GitBox

HyukjinKwon commented on a change in pull request #25360: 
[SPARK-28280][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into group by clause 
in 'udf-group-by.sql'
URL: https://github.com/apache/spark/pull/25360#discussion_r311327632
 
 

 ##
 File path: sql/core/src/test/resources/sql-tests/inputs/udf/udf-group-by.sql
 ##
 @@ -20,29 +20,25 @@ SELECT 'foo', COUNT(udf(a)) FROM testData GROUP BY 1;
 SELECT 'foo' FROM testData WHERE a = 0 GROUP BY udf(1);
 
 -- Aggregate grouped by literals (hash aggregate).
-SELECT 'foo', udf(APPROX_COUNT_DISTINCT(udf(a))) FROM testData WHERE a = 0 
GROUP BY 1;
+SELECT 'foo', udf(APPROX_COUNT_DISTINCT(udf(a))) FROM testData WHERE a = 0 
GROUP BY udf(1);
 
 -- Aggregate grouped by literals (sort aggregate).
-SELECT 'foo', MAX(STRUCT(udf(a))) FROM testData WHERE a = 0 GROUP BY 1;
+SELECT 'foo', MAX(STRUCT(udf(a))) FROM testData WHERE a = 0 GROUP BY udf(1);
 
 -- Aggregate with complex GroupBy expressions.
 SELECT udf(a + b), udf(COUNT(b)) FROM testData GROUP BY a + b;
 SELECT udf(a + 2), udf(COUNT(b)) FROM testData GROUP BY a + 1;
-
--- [SPARK-28445] Inconsistency between Scala and Python/Panda udfs when 
groupby with udf() is used
--- The following query will make Scala UDF work, but Python and Pandas udfs 
will fail with an AnalysisException.
--- The query should be added after SPARK-28445.
--- SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1);
+SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1);
 
 Review comment:
   Yes, I missed this diff. Could we match?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #25310: [SPARK-28578][INFRA] Improve Github pull request template

2019-08-06 Thread GitBox

HyukjinKwon commented on a change in pull request #25310: [SPARK-28578][INFRA] 
Improve Github pull request template
URL: https://github.com/apache/spark/pull/25310#discussion_r311328281
 
 

 ##
 File path: .github/PULL_REQUEST_TEMPLATE
 ##
 @@ -1,10 +1,40 @@
-## What changes were proposed in this pull request?
+
 
-(Please fill in changes proposed in this fix)
+### What changes were proposed in this pull request?
+
 
-## How was this patch tested?
 
-(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)
-(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
+### Why are the changes needed?
+
 
-Please review https://spark.apache.org/contributing.html before opening a pull 
request.
+### Does this PR introduce any user-facing change?
 
 Review comment:
   Basically yea. For instance, if we meet a breaking change after upgrading to 
higher Spark, time to time we should look through the commit history. I think 
it's better to clarify it if that can be identified early.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming

2019-08-06 Thread GitBox

SparkQA commented on issue #22282: [SPARK-23539][SS] Add support for Kafka 
headers in Structured Streaming
URL: https://github.com/apache/spark/pull/22282#issuecomment-518894131
 
 
   **[Test build #108735 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108735/testReport)**
 for PR 22282 at commit 
[`23dc81c`](https://github.com/apache/spark/commit/23dc81c7f5d479d92746bf19aeff9f69d6a32424).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations

2019-08-06 Thread GitBox

zhengruifeng commented on a change in pull request #25324: 
[SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations
URL: https://github.com/apache/spark/pull/25324#discussion_r311330089
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala
 ##
 @@ -100,19 +100,26 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 val outputSchema = transformSchema(dataset.schema)
-val hashUDF = udf { terms: Seq[_] =>
-  val numOfFeatures = $(numFeatures)
-  val isBinary = $(binary)
-  val termFrequencies = mutable.HashMap.empty[Int, 
Double].withDefaultValue(0.0)
-  terms.foreach { term =>
-val i = indexOf(term)
-if (isBinary) {
+val localNumFeatures = $(numFeatures)
+
+val hashUDF = if ($(binary)) {
 
 Review comment:
   ok, I misunderstood the comments


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gatorsmile commented on a change in pull request #25328: [SPARK-28595][SQL] explain should not trigger partition listing

2019-08-06 Thread GitBox

gatorsmile commented on a change in pull request #25328: [SPARK-28595][SQL] 
explain should not trigger partition listing
URL: https://github.com/apache/spark/pull/25328#discussion_r311330990
 
 

 ##
 File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveExplainSuite.scala
 ##
 @@ -207,4 +206,14 @@ class HiveExplainSuite extends QueryTest with 
SQLTestUtils with TestHiveSingleto
   }
 }
   }
+
+  test("SPARK-28595: explain should not trigger partition listing") {
+HiveCatalogMetrics.reset()
+withTable("t") {
+  sql("CREATE TABLE t USING json PARTITIONED BY (j) AS SELECT 1 i, 2 j")
+  assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0)
+  spark.table("t").explain()
+  assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0)
 
 Review comment:
   Add a test case that can return non-zero when 
`spark.sql.legacy.bucketedTableScan.outputOrdering` is set true? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gatorsmile commented on a change in pull request #25328: [SPARK-28595][SQL] explain should not trigger partition listing

2019-08-06 Thread GitBox

gatorsmile commented on a change in pull request #25328: [SPARK-28595][SQL] 
explain should not trigger partition listing
URL: https://github.com/apache/spark/pull/25328#discussion_r311330990
 
 

 ##
 File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveExplainSuite.scala
 ##
 @@ -207,4 +206,14 @@ class HiveExplainSuite extends QueryTest with 
SQLTestUtils with TestHiveSingleto
   }
 }
   }
+
+  test("SPARK-28595: explain should not trigger partition listing") {
+HiveCatalogMetrics.reset()
+withTable("t") {
+  sql("CREATE TABLE t USING json PARTITIONED BY (j) AS SELECT 1 i, 2 j")
+  assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0)
+  spark.table("t").explain()
+  assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0)
 
 Review comment:
   Add a test case that can return non-zero when 
`spark.sql.legacy.bucketedTableScan.outputOrdering` is set to true? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gatorsmile commented on issue #25328: [SPARK-28595][SQL] explain should not trigger partition listing

2019-08-06 Thread GitBox

gatorsmile commented on issue #25328: [SPARK-28595][SQL] explain should not 
trigger partition listing
URL: https://github.com/apache/spark/pull/25328#issuecomment-518896033
 
 
   LGTM except one comment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF 
Cleanup and Tiny Optimizations
URL: https://github.com/apache/spark/pull/25324#issuecomment-518896481
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/13817/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF 
Cleanup and Tiny Optimizations
URL: https://github.com/apache/spark/pull/25324#issuecomment-518896478
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations

2019-08-06 Thread GitBox

AmplabJenkins removed a comment on issue #25324: [SPARK-21481][ML][FOLLOWUP] 
HashingTF Cleanup and Tiny Optimizations
URL: https://github.com/apache/spark/pull/25324#issuecomment-518896478
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations

2019-08-06 Thread GitBox

AmplabJenkins removed a comment on issue #25324: [SPARK-21481][ML][FOLLOWUP] 
HashingTF Cleanup and Tiny Optimizations
URL: https://github.com/apache/spark/pull/25324#issuecomment-518896481
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/13817/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations

2019-08-06 Thread GitBox

SparkQA commented on issue #25324: [SPARK-21481][ML][FOLLOWUP] HashingTF 
Cleanup and Tiny Optimizations
URL: https://github.com/apache/spark/pull/25324#issuecomment-518896810
 
 
   **[Test build #108736 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108736/testReport)**
 for PR 25324 at commit 
[`0d5ca8c`](https://github.com/apache/spark/commit/0d5ca8cbb992c21d16465088ae8b2bb8bd5cb8f4).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #25349: [SPARK-28538][UI][WIP] Document SQL page

2019-08-06 Thread GitBox

zhengruifeng commented on a change in pull request #25349: 
[SPARK-28538][UI][WIP] Document SQL page
URL: https://github.com/apache/spark/pull/25349#discussion_r311332004
 
 

 ##
 File path: docs/monitoring.md
 ##
 @@ -40,6 +40,100 @@ To view the web UI after the fact, set 
`spark.eventLog.enabled` to true before s
 application. This configures Spark to log Spark events that encode the 
information displayed
 in the UI to persisted storage.
 
+## Web UI Tabs
+The web UI provides an overview of the Spark cluster and is composed of 
following tabs:
+
+### Jobs Tab
+The Jobs tab displays a summary page of all jobs in the Spark application and 
a detailed page
+for each job. The summary page shows high-level information, such as the 
status, duration, and
+progress of all jobs and the overall event timeline. When you click on a job 
on the summary
+page, you see the detailed page for that job. The detailed page further shows 
the event timeline,
+DAG visualization, and all stages of the job.
+
+### Stages Tab
+The Stages tab displays a summary page that shows the current state of all 
stages of all jobs in
+the Spark application, and, when you click on a stage, a detailed page for 
that stage. The details
+page shows the event timeline, DAG visualization, and all tasks for the stage.
+
+### Storage Tab
+The Storage tab displays the persisted RDDs, if any, in the application. The 
summary page shows
+the storage levels, sizes and partitions of all RDDs, and the detailed page 
shows the sizes and
+using executors for all partitions in an RDD.
+
+### Environment Tab
+The Environment tab displays the values for the different environment and 
configuration variables,
+including JVM, Spark, and system properties.
+
+### Executors Tab
+The Executors tab displays summary information about the executors that were 
created for the
+application, including memory and disk usage and task and shuffle information. 
The Storage Memory
+column shows the amount of memory used and reserved for caching data.
+
+### SQL Tab
+If the application executes Spark SQL queries, the SQL tab displays 
information, such as the duration,
+jobs, and physical and logical plans for the queries. Here we include a basic 
example to illustrate
+this tab:
+{% highlight scala %}
+scala> val df = Seq((1, "andy"), (2, "bob"), (2, "andy")).toDF("count", "name")
+df: org.apache.spark.sql.DataFrame = [count: int, name: string]
+
+scala> df.count
+res0: Long = 3 
 
+
+scala> df.createGlobalTempView("df")
+
+scala> spark.sql("select name,sum(count) from global_temp.df group by 
name").show
+++--+
+|name|sum(count)|
+++--+
+|andy| 3|
+| bob| 2|
+++--+
+{% endhighlight %}
+
+
+  
+  
+
+
+Now the above three dataframe/SQL operators are shown in the list. If we click 
the
+'show at \: 24' link of the last query, we will see the DAG of the 
job.
+
+
+  
+  
+
+
+We can see that detailed information of each stage. The first block 
'WholeStageCodegen'  
+compile multiple operator ('LocalTableScan' and 'HashAggregate') together into 
a single Java
+function to improve performance, and metrics like number of rows and spill 
size are listed in
+the block. The second block 'Exchange' shows the metrics on the shuffle 
exchange, including
+number of written shuffle records, total data size, etc.
+
+
+
+

[GitHub] [spark] HyukjinKwon commented on issue #25215: [SPARK-28445][SQL][PYTHON] Fix error when PythonUDF is used in both group by and aggregate expression

2019-08-06 Thread GitBox

HyukjinKwon commented on issue #25215: [SPARK-28445][SQL][PYTHON] Fix error 
when PythonUDF is used in both group by and aggregate expression
URL: https://github.com/apache/spark/pull/25215#issuecomment-518897609
 
 
   Thanks, @viirya. @shivusondur Can you comment the test out with those JIRA 
numbers?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming

2019-08-06 Thread GitBox

SparkQA commented on issue #22282: [SPARK-23539][SS] Add support for Kafka 
headers in Structured Streaming
URL: https://github.com/apache/spark/pull/22282#issuecomment-518898872
 
 
   **[Test build #108735 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108735/testReport)**
 for PR 22282 at commit 
[`23dc81c`](https://github.com/apache/spark/commit/23dc81c7f5d479d92746bf19aeff9f69d6a32424).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #24881: [SPARK-23160][SQL][TEST] Port window.sql

2019-08-06 Thread GitBox

maropu commented on a change in pull request #24881: [SPARK-23160][SQL][TEST] 
Port window.sql
URL: https://github.com/apache/spark/pull/24881#discussion_r311333666
 
 

 ##
 File path: sql/core/src/test/resources/sql-tests/inputs/pgSQL/window.sql
 ##
 @@ -0,0 +1,1175 @@
+-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+--
+-- Window Functions Testing
+-- 
https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/window.sql
+
+CREATE TEMPORARY VIEW tenk2 AS SELECT * FROM tenk1;
+
+CREATE TABLE empsalary (
+depname string,
+empno integer,
+salary int,
+enroll_date date
+) USING parquet;
+
+INSERT INTO empsalary VALUES
+('develop', 10, 5200, '2007-08-01'),
+('sales', 1, 5000, '2006-10-01'),
+('personnel', 5, 3500, '2007-12-10'),
+('sales', 4, 4800, '2007-08-08'),
+('personnel', 2, 3900, '2006-12-23'),
+('develop', 7, 4200, '2008-01-01'),
+('develop', 9, 4500, '2008-01-01'),
+('sales', 3, 4800, '2007-08-01'),
+('develop', 8, 6000, '2006-10-01'),
+('develop', 11, 5200, '2007-08-15');
+
+SELECT depname, empno, salary, sum(salary) OVER (PARTITION BY depname) FROM 
empsalary ORDER BY depname, salary;
+
+SELECT depname, empno, salary, rank() OVER (PARTITION BY depname ORDER BY 
salary) FROM empsalary;
+
+-- with GROUP BY
+SELECT four, ten, SUM(SUM(four)) OVER (PARTITION BY four), AVG(ten) FROM tenk1
+GROUP BY four, ten ORDER BY four, ten;
+
+SELECT depname, empno, salary, sum(salary) OVER w FROM empsalary WINDOW w AS 
(PARTITION BY depname);
+
+-- [SPARK-28064] Order by does not accept a call to rank()
+-- SELECT depname, empno, salary, rank() OVER w FROM empsalary WINDOW w AS 
(PARTITION BY depname ORDER BY salary) ORDER BY rank() OVER w;
+
+-- empty window specification
+SELECT COUNT(*) OVER () FROM tenk1 WHERE unique2 < 10;
+
+SELECT COUNT(*) OVER w FROM tenk1 WHERE unique2 < 10 WINDOW w AS ();
+
+-- no window operation
+SELECT four FROM tenk1 WHERE FALSE WINDOW w AS (PARTITION BY ten);
+
+-- cumulative aggregate
+SELECT sum(four) OVER (PARTITION BY ten ORDER BY unique2) AS sum_1, ten, four 
FROM tenk1 WHERE unique2 < 10;
+
+SELECT row_number() OVER (ORDER BY unique2) FROM tenk1 WHERE unique2 < 10;
+
+SELECT rank() OVER (PARTITION BY four ORDER BY ten) AS rank_1, ten, four FROM 
tenk1 WHERE unique2 < 10;
+
+SELECT dense_rank() OVER (PARTITION BY four ORDER BY ten), ten, four FROM 
tenk1 WHERE unique2 < 10;
+
+SELECT percent_rank() OVER (PARTITION BY four ORDER BY ten), ten, four FROM 
tenk1 WHERE unique2 < 10;
+
+SELECT cume_dist() OVER (PARTITION BY four ORDER BY ten), ten, four FROM tenk1 
WHERE unique2 < 10;
+
+SELECT ntile(3) OVER (ORDER BY ten, four), ten, four FROM tenk1 WHERE unique2 
< 10;
+
+-- [SPARK-28065] ntile does not accept NULL as input
+-- SELECT ntile(NULL) OVER (ORDER BY ten, four), ten, four FROM tenk1 LIMIT 2;
+
+SELECT lag(ten) OVER (PARTITION BY four ORDER BY ten), ten, four FROM tenk1 
WHERE unique2 < 10;
+
+-- [SPARK-28068] `lag` second argument must be a literal in Spark
+-- SELECT lag(ten, four) OVER (PARTITION BY four ORDER BY ten), ten, four FROM 
tenk1 WHERE unique2 < 10;
+-- SELECT lag(ten, four, 0) OVER (PARTITION BY four ORDER BY ten), ten, four 
FROM tenk1 WHERE unique2 < 10;
+
+SELECT lead(ten) OVER (PARTITION BY four ORDER BY ten), ten, four FROM tenk1 
WHERE unique2 < 10;
+
+SELECT lead(ten * 2, 1) OVER (PARTITION BY four ORDER BY ten), ten, four FROM 
tenk1 WHERE unique2 < 10;
+
+SELECT lead(ten * 2, 1, -1) OVER (PARTITION BY four ORDER BY ten), ten, four 
FROM tenk1 WHERE unique2 < 10;
+
+SELECT first(ten) OVER (PARTITION BY four ORDER BY ten), ten, four FROM tenk1 
WHERE unique2 < 10;
+
+-- last returns the last row of the frame, which is CURRENT ROW in ORDER BY 
window.
+SELECT last(four) OVER (ORDER BY ten), ten, four FROM tenk1 WHERE unique2 < 10;
+
+SELECT last(ten) OVER (PARTITION BY four), ten, four FROM
+(SELECT * FROM tenk1 WHERE unique2 < 10 ORDER BY four, ten)s
+ORDER BY four, ten;
+
+-- [SPARK-27951] ANSI SQL: NTH_VALUE function
+-- SELECT nth_value(ten, four + 1) OVER (PARTITION BY four), ten, four
+-- FROM (SELECT * FROM tenk1 WHERE unique2 < 10 ORDER BY four, ten)s;
+
+SELECT ten, two, sum(hundred) AS gsum, sum(sum(hundred)) OVER (PARTITION BY 
two ORDER BY ten) AS wsum
+FROM tenk1 GROUP BY ten, two;
+
+SELECT count(*) OVER (PARTITION BY four), four FROM (SELECT * FROM tenk1 WHERE 
two = 1)s WHERE unique2 < 10;
+
+SELECT (count(*) OVER (PARTITION BY four ORDER BY ten) +
+  sum(hundred) OVER (PARTITION BY four ORDER BY ten)) AS cntsum
+  FROM tenk1 WHERE unique2 < 10;
+
+-- opexpr with different windows evaluation.
+SELECT * FROM(
+  SELECT count(*) OVER (PARTITION BY four ORDER BY ten) +
+sum(hundred) OVER (PARTITION BY two ORDER BY ten) AS total,
+count(*) OVER (PARTITION BY four ORDER BY ten) AS fourcount,
+sum(hundred) OVER (PARTITION BY two ORDER BY ten) AS twosum
+FROM tenk1
+)sub WHERE total <> fourcount + twosum;
+
+SELECT avg(four) OVER (PARTITION BY four ORDER

[GitHub] [spark] SparkQA removed a comment on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming

2019-08-06 Thread GitBox

SparkQA removed a comment on issue #22282: [SPARK-23539][SS] Add support for 
Kafka headers in Structured Streaming
URL: https://github.com/apache/spark/pull/22282#issuecomment-518894131
 
 
   **[Test build #108735 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108735/testReport)**
 for PR 22282 at commit 
[`23dc81c`](https://github.com/apache/spark/commit/23dc81c7f5d479d92746bf19aeff9f69d6a32424).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #22282: [SPARK-23539][SS] Add support for 
Kafka headers in Structured Streaming
URL: https://github.com/apache/spark/pull/22282#issuecomment-518898951
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/108735/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base

2019-08-06 Thread GitBox

SparkQA commented on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert 
and port 'pgSQL/join.sql' into UDF test base
URL: https://github.com/apache/spark/pull/25371#issuecomment-518898909
 
 
   **[Test build #108732 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108732/testReport)**
 for PR 25371 at commit 
[`b7aad47`](https://github.com/apache/spark/commit/b7aad47c0121501343729c7d73a75df74722267e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming

2019-08-06 Thread GitBox

AmplabJenkins commented on issue #22282: [SPARK-23539][SS] Add support for 
Kafka headers in Structured Streaming
URL: https://github.com/apache/spark/pull/22282#issuecomment-518898945
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming

2019-08-06 Thread GitBox

AmplabJenkins removed a comment on issue #22282: [SPARK-23539][SS] Add support 
for Kafka headers in Structured Streaming
URL: https://github.com/apache/spark/pull/22282#issuecomment-518898945
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming

2019-08-06 Thread GitBox

AmplabJenkins removed a comment on issue #22282: [SPARK-23539][SS] Add support 
for Kafka headers in Structured Streaming
URL: https://github.com/apache/spark/pull/22282#issuecomment-518898951
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/108735/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base

2019-08-06 Thread GitBox

SparkQA removed a comment on issue #25371: [SPARK-28393][SQL][PYTHON][TESTS] 
Convert and port 'pgSQL/join.sql' into UDF test base
URL: https://github.com/apache/spark/pull/25371#issuecomment-518856479
 
 
   **[Test build #108732 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108732/testReport)**
 for PR 25371 at commit 
[`b7aad47`](https://github.com/apache/spark/commit/b7aad47c0121501343729c7d73a75df74722267e).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 >

501 - 600 of 854 matches

Mail list logo