[GitHub] [spark] c21 commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


c21 commented on pull request #29473:
URL: https://github.com/apache/spark/pull/29473#issuecomment-683245934


   > did you observe any patterns or heuristics on your workloads where 
repartition is preferred?
   
   From our side, honestly now we don't have any automation for deciding 
coalesce vs repartition. We provided configs similar here for users themselves 
to control coalesce vs repartition.
   
   I think a rule of thumb can be we don't want to
   (1).coalesce: if the coalesced table is too big and # of coalesced buckets 
is too few, then each task has too much data and will take more time.
   (2).repartition: if the repartition table is too big and # of repartitioned 
buckets is too many, then too much duplicated data is read and will have too 
much more CPU/IO cost (might be worse than just shuffling this table). 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


c21 commented on a change in pull request #29473:
URL: https://github.com/apache/spark/pull/29473#discussion_r479612474



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceOrRepartitionBucketsInJoin.scala
##
@@ -27,45 +27,48 @@ import org.apache.spark.sql.catalyst.rules.Rule
 import org.apache.spark.sql.execution.{FileSourceScanExec, FilterExec, 
ProjectExec, SparkPlan}
 import org.apache.spark.sql.execution.joins.{BaseJoinExec, 
ShuffledHashJoinExec, SortMergeJoinExec}
 import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.SQLConf.BucketReadStrategyInJoin
 
 /**
- * This rule coalesces one side of the `SortMergeJoin` and `ShuffledHashJoin`
+ * This rule coalesces or repartitions one side of the `SortMergeJoin` and 
`ShuffledHashJoin`
  * if the following conditions are met:
  *   - Two bucketed tables are joined.
  *   - Join keys match with output partition expressions on their respective 
sides.
  *   - The larger bucket number is divisible by the smaller bucket number.
- *   - COALESCE_BUCKETS_IN_JOIN_ENABLED is set to true.
  *   - The ratio of the number of buckets is less than the value set in
- * COALESCE_BUCKETS_IN_JOIN_MAX_BUCKET_RATIO.
+ * COALESCE_OR_REPARTITION_BUCKETS_IN_JOIN_MAX_BUCKET_RATIO.

Review comment:
   nit: shouldn't it be `BUCKET_READ_STRATEGY_IN_JOIN_MAX_BUCKET_RATIO` ?

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -2655,24 +2655,34 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
-  val COALESCE_BUCKETS_IN_JOIN_ENABLED =
-buildConf("spark.sql.bucketing.coalesceBucketsInJoin.enabled")
-  .doc("When true, if two bucketed tables with the different number of 
buckets are joined, " +
-"the side with a bigger number of buckets will be coalesced to have 
the same number " +
-"of buckets as the other side. Bigger number of buckets is divisible 
by the smaller " +
-"number of buckets. Bucket coalescing is applied to sort-merge joins 
and " +
-"shuffled hash join. Note: Coalescing bucketed table can avoid 
unnecessary shuffling " +
-"in join, but it also reduces parallelism and could possibly cause OOM 
for " +
-"shuffled hash join.")
-  .version("3.1.0")
-  .booleanConf
-  .createWithDefault(false)
+  object BucketReadStrategyInJoin extends Enumeration {
+val COALESCE, REPARTITION, OFF = Value
+  }
 
-  val COALESCE_BUCKETS_IN_JOIN_MAX_BUCKET_RATIO =
-buildConf("spark.sql.bucketing.coalesceBucketsInJoin.maxBucketRatio")
-  .doc("The ratio of the number of two buckets being coalesced should be 
less than or " +
-"equal to this value for bucket coalescing to be applied. This 
configuration only " +
-s"has an effect when '${COALESCE_BUCKETS_IN_JOIN_ENABLED.key}' is set 
to true.")
+  val BUCKET_READ_STRATEGY_IN_JOIN =
+buildConf("spark.sql.bucketing.bucketReadStrategyInJoin")
+  .doc("When set to COALESCE, if two bucketed tables with the different 
number of buckets " +

Review comment:
   nit: shall we first mention the allowed values to be "one of COALESCE, 
REPARTITION, OFF"? User might not follow exactly after long description here. 
Also probably worth to mention by default is "OFF" where we do not coalesce and 
repartition.

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
##
@@ -548,22 +560,42 @@ case class FileSourceScanExec(
   filesGroupedToBuckets
 }
 
-val filePartitions = optionalNumCoalescedBuckets.map { numCoalescedBuckets 
=>
-  logInfo(s"Coalescing to ${numCoalescedBuckets} buckets")
-  val coalescedBuckets = prunedFilesGroupedToBuckets.groupBy(_._1 % 
numCoalescedBuckets)
-  Seq.tabulate(numCoalescedBuckets) { bucketId =>
-val partitionedFiles = coalescedBuckets.get(bucketId).map {
-  _.values.flatten.toArray
-}.getOrElse(Array.empty)
-FilePartition(bucketId, partitionedFiles)
-  }
-}.getOrElse {
-  Seq.tabulate(bucketSpec.numBuckets) { bucketId =>
+if (optionalNewNumBuckets.isEmpty) {
+  val filePartitions = Seq.tabulate(bucketSpec.numBuckets) { bucketId =>
 FilePartition(bucketId, 
prunedFilesGroupedToBuckets.getOrElse(bucketId, Array.empty))
   }
+  new FileScanRDD(fsRelation.sparkSession, readFile, filePartitions)
+} else {
+  val newNumBuckets = optionalNewNumBuckets.get
+  if (newNumBuckets < bucketSpec.numBuckets) {
+assert(bucketSpec.numBuckets % newNumBuckets == 0)
+logInfo(s"Coalescing to $newNumBuckets buckets from 
${bucketSpec.numBuckets} buckets")
+val coalescedBuckets = prunedFilesGroupedToBuckets.groupBy(_._1 % 
newNumBuckets)
+val filePartitions = Seq.tabulate(newNumBuckets) { bucketId =>
+  val partitionedFiles = coalescedBuckets
+.get(bucketId)
+ 

[GitHub] [spark] HyukjinKwon commented on pull request #28968: [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode

2020-08-28 Thread GitBox


HyukjinKwon commented on pull request #28968:
URL: https://github.com/apache/spark/pull/28968#issuecomment-683241944


   Thank you for kaing a look @viirya.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon edited a comment on pull request #28968: [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode

2020-08-28 Thread GitBox


HyukjinKwon edited a comment on pull request #28968:
URL: https://github.com/apache/spark/pull/28968#issuecomment-683241944


   Thank you for taking a look @viirya.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #28968: [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode

2020-08-28 Thread GitBox


HyukjinKwon commented on a change in pull request #28968:
URL: https://github.com/apache/spark/pull/28968#discussion_r479612214



##
File path: python/pyspark/util.py
##
@@ -114,6 +117,64 @@ def _parse_memory(s):
 raise ValueError("invalid format: " + s)
 return int(float(s[:-1]) * units[s[-1].lower()])
 
+
+class InheritableThread(threading.Thread):
+"""
+Thread that is recommended to be used in PySpark instead of 
:class:`threading.Thread`
+when the pinned thread mode is enabled. The usage of this class is exactly 
same as
+:class:`threading.Thread` but correctly inherits the inheritable 
properties specific
+to JVM thread such as ``InheritableThreadLocal``.
+
+Also, note that pinned thread mode does not close the connection from 
Python
+to JVM when the thread is finished in the Python side. With this class, 
Python
+garbage-collects the Python thread instance and also closes the connection
+which finishes JVM thread correctly.
+
+When the pinned thread mode is off, this works as 
:class:`threading.Thread`.
+
+.. note:: Experimental
+
+.. versionadded:: 3.1.0
+"""
+def __init__(self, target, *args, **kwargs):
+from pyspark import SparkContext
+
+sc = SparkContext._active_spark_context
+
+if isinstance(sc._gateway, ClientServer):
+# Here's when the pinned-thread mode (PYSPARK_PIN_THREAD) is on.
+properties = sc._jsc.sc().getLocalProperties().clone()

Review comment:
   Actually we're mimicking that behaviour here because the thread in JVM 
does not respect the inheritance here since the thread is always sepearately 
created via the JVM gateway whereas Scala Java side we can keep the inheritance 
by creating a thread within a thread.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28968: [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode

2020-08-28 Thread GitBox


HyukjinKwon commented on pull request #28968:
URL: https://github.com/apache/spark/pull/28968#issuecomment-683241584


   Oh yeah we should use `InheritableThread` instead of `Thread` to verify this 
PR.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29473:
URL: https://github.com/apache/spark/pull/29473#issuecomment-683241374







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29473:
URL: https://github.com/apache/spark/pull/29473#issuecomment-683241374







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


SparkQA removed a comment on pull request #29473:
URL: https://github.com/apache/spark/pull/29473#issuecomment-683211632


   **[Test build #128011 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128011/testReport)**
 for PR 29473 at commit 
[`e2374ac`](https://github.com/apache/spark/commit/e2374ac281bbcb23c0dc49786ce7d8148f9761bd).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


SparkQA commented on pull request #29473:
URL: https://github.com/apache/spark/pull/29473#issuecomment-683241185


   **[Test build #128011 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128011/testReport)**
 for PR 29473 at commit 
[`e2374ac`](https://github.com/apache/spark/commit/e2374ac281bbcb23c0dc49786ce7d8148f9761bd).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jiangxb1987 commented on a change in pull request #29228: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.

2020-08-28 Thread GitBox


jiangxb1987 commented on a change in pull request #29228:
URL: https://github.com/apache/spark/pull/29228#discussion_r479608346



##
File path: core/src/test/scala/org/apache/spark/LocalSC.scala
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import _root_.io.netty.util.internal.logging.{InternalLoggerFactory, 
Slf4JLoggerFactory}
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.BeforeAndAfterEach
+import org.scalatest.Suite
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.resource.ResourceProfile
+
+/**
+ * Manages a local `sc` `SparkContext` variable, correctly stopping it after 
each test.
+ *
+ * Note: this class is a copy of [[LocalSparkContext]]. Why copy it? Reduce 
conflict. Because
+ * many test suites use [[LocalSparkContext]] and overwrite some variable or 
function (e.g.
+ * sc of LocalSparkContext), there occurs conflict when we refactor the `sc` 
as a new function.
+ * After migrating all test suites that use [[LocalSparkContext]] to use 
[[LocalSC]], we will
+ * delete the original [[LocalSparkContext]] and rename [[LocalSC]] to 
[[LocalSparkContext]].
+ */
+trait LocalSC extends BeforeAndAfterEach

Review comment:
   Since this class is only used for temporary purpose, can we name it as 
`TempLocalSparkContext` ? TBH I don't like the `SC` name which is very vague to 
me.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jiangxb1987 commented on pull request #29228: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.

2020-08-28 Thread GitBox


jiangxb1987 commented on pull request #29228:
URL: https://github.com/apache/spark/pull/29228#issuecomment-683237385


   LGTM otherwise



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29575:
URL: https://github.com/apache/spark/pull/29575#issuecomment-683236144







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


SparkQA removed a comment on pull request #29575:
URL: https://github.com/apache/spark/pull/29575#issuecomment-683198820


   **[Test build #128008 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128008/testReport)**
 for PR 29575 at commit 
[`141c8f3`](https://github.com/apache/spark/commit/141c8f3eecea97ecc75ee02806566a9f29f41af8).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29575:
URL: https://github.com/apache/spark/pull/29575#issuecomment-683236144







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


SparkQA commented on pull request #29575:
URL: https://github.com/apache/spark/pull/29575#issuecomment-683235988


   **[Test build #128008 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128008/testReport)**
 for PR 29575 at commit 
[`141c8f3`](https://github.com/apache/spark/commit/141c8f3eecea97ecc75ee02806566a9f29f41af8).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #28968: [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode

2020-08-28 Thread GitBox


viirya commented on pull request #28968:
URL: https://github.com/apache/spark/pull/28968#issuecomment-683235995


   I found I missed this and looked at now. LGTM. I'm just wondering we should 
use `InheritableThread` in the PR description to verify the fix?
   
   ```python
   >>> from threading import Thread
   >>> Thread(target=lambda: spark.range(1000).collect()).start()
   >>> Thread(target=lambda: spark.range(1000).collect()).start()
   >>> Thread(target=lambda: spark.range(1000).collect()).start()
   >>> spark._jvm._gateway_client.deque
   deque([, 
, 
, 
, 
])
   >>> Thread(target=lambda: spark.range(1000).collect()).start()
   >>> spark._jvm._gateway_client.deque
   deque([, 
, 
, 
, 
, 
])
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28968: [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode

2020-08-28 Thread GitBox


viirya commented on a change in pull request #28968:
URL: https://github.com/apache/spark/pull/28968#discussion_r479605991



##
File path: python/pyspark/util.py
##
@@ -114,6 +117,64 @@ def _parse_memory(s):
 raise ValueError("invalid format: " + s)
 return int(float(s[:-1]) * units[s[-1].lower()])
 
+
+class InheritableThread(threading.Thread):
+"""
+Thread that is recommended to be used in PySpark instead of 
:class:`threading.Thread`
+when the pinned thread mode is enabled. The usage of this class is exactly 
same as
+:class:`threading.Thread` but correctly inherits the inheritable 
properties specific
+to JVM thread such as ``InheritableThreadLocal``.
+
+Also, note that pinned thread mode does not close the connection from 
Python
+to JVM when the thread is finished in the Python side. With this class, 
Python
+garbage-collects the Python thread instance and also closes the connection
+which finishes JVM thread correctly.
+
+When the pinned thread mode is off, this works as 
:class:`threading.Thread`.
+
+.. note:: Experimental
+
+.. versionadded:: 3.1.0
+"""
+def __init__(self, target, *args, **kwargs):
+from pyspark import SparkContext
+
+sc = SparkContext._active_spark_context
+
+if isinstance(sc._gateway, ClientServer):
+# Here's when the pinned-thread mode (PYSPARK_PIN_THREAD) is on.
+properties = sc._jsc.sc().getLocalProperties().clone()

Review comment:
   Why we need to `clone`? Doesn't `sc.localProperties` get clone in 
`childValue` already?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28968: [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode

2020-08-28 Thread GitBox


viirya commented on a change in pull request #28968:
URL: https://github.com/apache/spark/pull/28968#discussion_r479605991



##
File path: python/pyspark/util.py
##
@@ -114,6 +117,64 @@ def _parse_memory(s):
 raise ValueError("invalid format: " + s)
 return int(float(s[:-1]) * units[s[-1].lower()])
 
+
+class InheritableThread(threading.Thread):
+"""
+Thread that is recommended to be used in PySpark instead of 
:class:`threading.Thread`
+when the pinned thread mode is enabled. The usage of this class is exactly 
same as
+:class:`threading.Thread` but correctly inherits the inheritable 
properties specific
+to JVM thread such as ``InheritableThreadLocal``.
+
+Also, note that pinned thread mode does not close the connection from 
Python
+to JVM when the thread is finished in the Python side. With this class, 
Python
+garbage-collects the Python thread instance and also closes the connection
+which finishes JVM thread correctly.
+
+When the pinned thread mode is off, this works as 
:class:`threading.Thread`.
+
+.. note:: Experimental
+
+.. versionadded:: 3.1.0
+"""
+def __init__(self, target, *args, **kwargs):
+from pyspark import SparkContext
+
+sc = SparkContext._active_spark_context
+
+if isinstance(sc._gateway, ClientServer):
+# Here's when the pinned-thread mode (PYSPARK_PIN_THREAD) is on.
+properties = sc._jsc.sc().getLocalProperties().clone()

Review comment:
   Why we need to `clone`? Isn't `sc.localProperties` gets clone in 
`childValue` already?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28968: [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode

2020-08-28 Thread GitBox


viirya commented on a change in pull request #28968:
URL: https://github.com/apache/spark/pull/28968#discussion_r479604890



##
File path: docs/job-scheduling.md
##
@@ -297,11 +297,9 @@ via `sc.setJobGroup` in a separate PVM thread, which also 
disallows to cancel th
 later.
 
 In order to synchronize PVM threads with JVM threads, you should set 
`PYSPARK_PIN_THREAD` environment variable
-to `true`. This pinned thread mode allows one PVM thread has one corresponding 
JVM thread.
-
-However, currently it cannot inherit the local properties from the parent 
thread although it isolates
-each thread with its own local properties. To work around this, you should 
manually copy and set the
-local properties from the parent thread to the child thread when you create 
another thread in PVM.
+to `true`. This pinned thread mode allows one PVM thread has one corresponding 
JVM thread. With this mode,
+`pyspark.InheritableThread` is recommanded to use together for a PVM thread to 
inherit the interitable attributes

Review comment:
   typo: interitable -> inheritable





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] manuzhang commented on pull request #29540: [SPARK-32698][SQL] Do not fall back to default parallelism if the minimum number of coalesced partitions is not set in AQE

2020-08-28 Thread GitBox


manuzhang commented on pull request #29540:
URL: https://github.com/apache/spark/pull/29540#issuecomment-683232962


   @cloud-fan 
   What I mean is
   `spark.sql.adaptive.coalescePartitions.initialPartitionNum` > 
   `spark.default.parallelism` > 
   `spark.sql.shuffle.partitions`
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29576:
URL: https://github.com/apache/spark/pull/29576#issuecomment-683230726







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29576:
URL: https://github.com/apache/spark/pull/29576#issuecomment-683230726







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


SparkQA commented on pull request #29576:
URL: https://github.com/apache/spark/pull/29576#issuecomment-683230650


   **[Test build #128009 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128009/testReport)**
 for PR 29576 at commit 
[`d015266`](https://github.com/apache/spark/commit/d015266e1d275ff9cc2c1e75a11267079827bd3d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


SparkQA removed a comment on pull request #29576:
URL: https://github.com/apache/spark/pull/29576#issuecomment-683201784


   **[Test build #128009 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128009/testReport)**
 for PR 29576 at commit 
[`d015266`](https://github.com/apache/spark/commit/d015266e1d275ff9cc2c1e75a11267079827bd3d).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29353: [SPARK-32532][SQL] Improve ORC read/write performance on nested structs and array of structs

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29353:
URL: https://github.com/apache/spark/pull/29353#issuecomment-683229295







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29353: [SPARK-32532][SQL] Improve ORC read/write performance on nested structs and array of structs

2020-08-28 Thread GitBox


SparkQA removed a comment on pull request #29353:
URL: https://github.com/apache/spark/pull/29353#issuecomment-683181113


   **[Test build #128005 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128005/testReport)**
 for PR 29353 at commit 
[`d67ceed`](https://github.com/apache/spark/commit/d67ceed965fce5f56f2096032188c6fa9b3cfa5b).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29353: [SPARK-32532][SQL] Improve ORC read/write performance on nested structs and array of structs

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29353:
URL: https://github.com/apache/spark/pull/29353#issuecomment-683229295







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29353: [SPARK-32532][SQL] Improve ORC read/write performance on nested structs and array of structs

2020-08-28 Thread GitBox


SparkQA commented on pull request #29353:
URL: https://github.com/apache/spark/pull/29353#issuecomment-683229139


   **[Test build #128005 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128005/testReport)**
 for PR 29353 at commit 
[`d67ceed`](https://github.com/apache/spark/commit/d67ceed965fce5f56f2096032188c6fa9b3cfa5b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683227764







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683227764







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


SparkQA removed a comment on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683203194


   **[Test build #128010 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128010/testReport)**
 for PR 29574 at commit 
[`f1b46f3`](https://github.com/apache/spark/commit/f1b46f30d7c7d237c5c16a71ada3456c44089adc).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


SparkQA commented on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683227542


   **[Test build #128010 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128010/testReport)**
 for PR 29574 at commit 
[`f1b46f3`](https://github.com/apache/spark/commit/f1b46f30d7c7d237c5c16a71ada3456c44089adc).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29473:
URL: https://github.com/apache/spark/pull/29473#issuecomment-683227452







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29473:
URL: https://github.com/apache/spark/pull/29473#issuecomment-683227452







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


SparkQA commented on pull request #29473:
URL: https://github.com/apache/spark/pull/29473#issuecomment-683227251


   **[Test build #128012 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128012/testReport)**
 for PR 29473 at commit 
[`7481e36`](https://github.com/apache/spark/commit/7481e36d8781e869a0dc558e0af5d358a56ab150).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683225470







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683225470







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


SparkQA removed a comment on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683197133


   **[Test build #128007 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128007/testReport)**
 for PR 29574 at commit 
[`85fa83a`](https://github.com/apache/spark/commit/85fa83ad978fd0ae2b757fd2d72272dc54e3089b).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


SparkQA commented on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683225203


   **[Test build #128007 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128007/testReport)**
 for PR 29574 at commit 
[`85fa83a`](https://github.com/apache/spark/commit/85fa83ad978fd0ae2b757fd2d72272dc54e3089b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29352: [SPARK-32531][SQL][TEST] Add benchmarks for nested structs and arrays for different file formats

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29352:
URL: https://github.com/apache/spark/pull/29352#issuecomment-683222745







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29352: [SPARK-32531][SQL][TEST] Add benchmarks for nested structs and arrays for different file formats

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29352:
URL: https://github.com/apache/spark/pull/29352#issuecomment-683222745







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29352: [SPARK-32531][SQL][TEST] Add benchmarks for nested structs and arrays for different file formats

2020-08-28 Thread GitBox


SparkQA removed a comment on pull request #29352:
URL: https://github.com/apache/spark/pull/29352#issuecomment-683160455


   **[Test build #128004 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128004/testReport)**
 for PR 29352 at commit 
[`bef3d35`](https://github.com/apache/spark/commit/bef3d357ecdbe3be4a468184b4917b540ff7625e).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29352: [SPARK-32531][SQL][TEST] Add benchmarks for nested structs and arrays for different file formats

2020-08-28 Thread GitBox


SparkQA commented on pull request #29352:
URL: https://github.com/apache/spark/pull/29352#issuecomment-683222409


   **[Test build #128004 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128004/testReport)**
 for PR 29352 at commit 
[`bef3d35`](https://github.com/apache/spark/commit/bef3d357ecdbe3be4a468184b4917b540ff7625e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #29567: [SPARK-32721][SQL] Simplify if clauses with null and boolean

2020-08-28 Thread GitBox


sunchao commented on pull request #29567:
URL: https://github.com/apache/spark/pull/29567#issuecomment-683219924


   cc @dbtsai @viirya 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29473:
URL: https://github.com/apache/spark/pull/29473#issuecomment-683219778







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29473:
URL: https://github.com/apache/spark/pull/29473#issuecomment-683219778







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a change in pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2020-08-28 Thread GitBox


sunchao commented on a change in pull request #29542:
URL: https://github.com/apache/spark/pull/29542#discussion_r479594051



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
##
@@ -92,60 +88,24 @@
   public void initialize(InputSplit inputSplit, TaskAttemptContext 
taskAttemptContext)
   throws IOException, InterruptedException {
 Configuration configuration = taskAttemptContext.getConfiguration();
-ParquetInputSplit split = (ParquetInputSplit)inputSplit;
+FileSplit split = (FileSplit) inputSplit;
 this.file = split.getPath();
-long[] rowGroupOffsets = split.getRowGroupOffsets();
 
-ParquetMetadata footer;
-List blocks;
-
-// if task.side.metadata is set, rowGroupOffsets is null
-if (rowGroupOffsets == null) {
-  // then we need to apply the predicate push down filter
-  footer = readFooter(configuration, file, range(split.getStart(), 
split.getEnd()));
-  MessageType fileSchema = footer.getFileMetaData().getSchema();
-  FilterCompat.Filter filter = getFilter(configuration);
-  blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema);
-} else {
-  // otherwise we find the row groups that were selected on the client
-  footer = readFooter(configuration, file, NO_FILTER);

Review comment:
   I think this path is never triggered. You can see below that we always 
construct `ParquetInputSplit` by initializing the `rowGroupOffsets` with null. 
The `rowGroupOffsets` is also deprecated along with the `ParquetInputSplit` 
class.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


SparkQA removed a comment on pull request #29473:
URL: https://github.com/apache/spark/pull/29473#issuecomment-683158107


   **[Test build #128003 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128003/testReport)**
 for PR 29473 at commit 
[`5665bc1`](https://github.com/apache/spark/commit/5665bc1107d6f9f06d1663a6f1ba8fa2ef5491e5).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


SparkQA commented on pull request #29473:
URL: https://github.com/apache/spark/pull/29473#issuecomment-683219393


   **[Test build #128003 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128003/testReport)**
 for PR 29473 at commit 
[`5665bc1`](https://github.com/apache/spark/commit/5665bc1107d6f9f06d1663a6f1ba8fa2ef5491e5).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a change in pull request #29565: [SPARK-24994][SQL] Simplify casts for literal types

2020-08-28 Thread GitBox


sunchao commented on a change in pull request #29565:
URL: https://github.com/apache/spark/pull/29565#discussion_r479593866



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
##
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.Literal.FalseLiteral
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.types._
+
+/**
+ * Unwrap casts in binary comparison operations with patterns like following:
+ *
+ * `BinaryComparison(Cast(fromExp, toType), Literal(value, toType))`
+ *   or
+ * `BinaryComparison(Literal(value, toType), Cast(fromExp, toType))`
+ *
+ * This rule optimizes expressions with the above pattern by either replacing 
the cast with simpler
+ * constructs, or moving the cast from the expression side to the literal 
side, which enables them
+ * to be optimized away later and pushed down to data sources.
+ *
+ * Currently this only handles cases where `fromType` (of `fromExp`) and 
`toType` are of integral
+ * types (i.e., byte, short, int and long). The rule checks to see if the 
literal `value` is
+ * within range `(min, max)`, where `min` and `max` are the minimum and 
maximum value of
+ * `fromType`, respectively. If this is true then it means we can safely cast 
`value` to `fromType`
+ * and thus able to move the cast to the literal side.
+ *
+ * If the `value` is not within range `(min, max)`, the rule breaks the 
scenario into different
+ * cases and try to replace each with simpler constructs.
+ *
+ * if `value > max`, the cases are of following:
+ *  - `cast(exp, ty) > value` ==> if(isnull(exp), null, false)
+ *  - `cast(exp, ty) >= value` ==> if(isnull(exp), null, false)
+ *  - `cast(exp, ty) === value` ==> if(isnull(exp), null, false)
+ *  - `cast(exp, ty) <=> value` ==> false
+ *  - `cast(exp, ty) <= value` ==> if(isnull(exp), null, true)
+ *  - `cast(exp, ty) < value` ==> if(isnull(exp), null, true)
+ *
+ * if `value == max`, the cases are of following:
+ *  - `cast(exp, ty) > value` ==> if(isnull(exp), null, false)
+ *  - `cast(exp, ty) >= value` ==> exp == max
+ *  - `cast(exp, ty) === value` ==> exp == max
+ *  - `cast(exp, ty) <=> value` ==> exp == max
+ *  - `cast(exp, ty) <= value` ==> if(isnull(exp), null, true)
+ *  - `cast(exp, ty) < value` ==> exp =!= max
+ *
+ * Similarly for the cases when `value == min` and `value < min`.
+ *
+ * Further, the above `if(isnull(exp), null, false)` is represented using 
conjunction
+ * `and(isnull(exp), null)`, to enable further optimization and filter 
pushdown to data sources.
+ * Similarly, `if(isnull(exp), null, true)` is represented with 
`or(isnotnull(exp), null)`.
+ */
+object UnwrapCast extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case l: LogicalPlan => l transformExpressionsUp {
+  case e @ BinaryComparison(_, _) => unwrapCast(e)
+}
+  }
+
+  private def unwrapCast(exp: Expression): Expression = exp match {
+case BinaryComparison(Literal(_, _), Cast(_, _, _)) =>
+  // Not a canonical form. In this case we first canonicalize the 
expression by swapping the
+  // literal and cast side, then process the result and swap the literal 
and cast again to
+  // restore the original order.
+  def swap(e: Expression): Expression = e match {
+case GreaterThan(left, right) => LessThan(right, left)
+case GreaterThanOrEqual(left, right) => LessThanOrEqual(right, left)
+case EqualTo(left, right) => EqualTo(right, left)
+case EqualNullSafe(left, right) => EqualNullSafe(right, left)
+case LessThanOrEqual(left, right) => GreaterThanOrEqual(right, left)
+case LessThan(left, right) => GreaterThan(right, left)
+case _ => e
+  }
+
+  swap(unwrapCast(swap(exp)))
+
+case BinaryComparison(Cast(fromExp, _, _), Literal(value, toType))
+  if canImplicitlyCast(fromExp, toType) =>
+
+  // In case both sides have integral type, o

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29567: [WIP][SPARK-32721][SQL] Simplify if clauses with null and boolean

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29567:
URL: https://github.com/apache/spark/pull/29567#issuecomment-683214869







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29567: [WIP][SPARK-32721][SQL] Simplify if clauses with null and boolean

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29567:
URL: https://github.com/apache/spark/pull/29567#issuecomment-683214869







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29567: [WIP][SPARK-32721][SQL] Simplify if clauses with null and boolean

2020-08-28 Thread GitBox


SparkQA removed a comment on pull request #29567:
URL: https://github.com/apache/spark/pull/29567#issuecomment-683147685


   **[Test build #128002 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128002/testReport)**
 for PR 29567 at commit 
[`cc66198`](https://github.com/apache/spark/commit/cc661984f3ccbf59bbabb91c7d92b17524ab74d3).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29567: [WIP][SPARK-32721][SQL] Simplify if clauses with null and boolean

2020-08-28 Thread GitBox


SparkQA commented on pull request #29567:
URL: https://github.com/apache/spark/pull/29567#issuecomment-683214525


   **[Test build #128002 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128002/testReport)**
 for PR 29567 at commit 
[`cc66198`](https://github.com/apache/spark/commit/cc661984f3ccbf59bbabb91c7d92b17524ab74d3).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2020-08-28 Thread GitBox


viirya commented on a change in pull request #29542:
URL: https://github.com/apache/spark/pull/29542#discussion_r479591040



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
##
@@ -92,60 +88,24 @@
   public void initialize(InputSplit inputSplit, TaskAttemptContext 
taskAttemptContext)
   throws IOException, InterruptedException {
 Configuration configuration = taskAttemptContext.getConfiguration();
-ParquetInputSplit split = (ParquetInputSplit)inputSplit;
+FileSplit split = (FileSplit) inputSplit;
 this.file = split.getPath();
-long[] rowGroupOffsets = split.getRowGroupOffsets();
 
-ParquetMetadata footer;
-List blocks;
-
-// if task.side.metadata is set, rowGroupOffsets is null
-if (rowGroupOffsets == null) {
-  // then we need to apply the predicate push down filter
-  footer = readFooter(configuration, file, range(split.getStart(), 
split.getEnd()));
-  MessageType fileSchema = footer.getFileMetaData().getSchema();
-  FilterCompat.Filter filter = getFilter(configuration);
-  blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema);
-} else {
-  // otherwise we find the row groups that were selected on the client
-  footer = readFooter(configuration, file, NO_FILTER);

Review comment:
   We don't need row groups selection here too?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2020-08-28 Thread GitBox


viirya commented on a change in pull request #29542:
URL: https://github.com/apache/spark/pull/29542#discussion_r479590396



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
##
@@ -92,60 +88,24 @@
   public void initialize(InputSplit inputSplit, TaskAttemptContext 
taskAttemptContext)
   throws IOException, InterruptedException {
 Configuration configuration = taskAttemptContext.getConfiguration();
-ParquetInputSplit split = (ParquetInputSplit)inputSplit;
+FileSplit split = (FileSplit) inputSplit;
 this.file = split.getPath();
-long[] rowGroupOffsets = split.getRowGroupOffsets();
 
-ParquetMetadata footer;
-List blocks;
-
-// if task.side.metadata is set, rowGroupOffsets is null
-if (rowGroupOffsets == null) {
-  // then we need to apply the predicate push down filter
-  footer = readFooter(configuration, file, range(split.getStart(), 
split.getEnd()));
-  MessageType fileSchema = footer.getFileMetaData().getSchema();
-  FilterCompat.Filter filter = getFilter(configuration);

Review comment:
   Ok, I see. Tracing into Parquet source code. `HadoopReadOptions` will 
read filter by `getFilter` and `ParquetFileReader` will use it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a change in pull request #29565: [SPARK-24994][SQL] Simplify casts for literal types

2020-08-28 Thread GitBox


sunchao commented on a change in pull request #29565:
URL: https://github.com/apache/spark/pull/29565#discussion_r479590420



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
##
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.Literal.FalseLiteral
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.types._
+
+/**
+ * Unwrap casts in binary comparison operations with patterns like following:
+ *
+ * `BinaryComparison(Cast(fromExp, toType), Literal(value, toType))`
+ *   or
+ * `BinaryComparison(Literal(value, toType), Cast(fromExp, toType))`
+ *
+ * This rule optimizes expressions with the above pattern by either replacing 
the cast with simpler
+ * constructs, or moving the cast from the expression side to the literal 
side, which enables them
+ * to be optimized away later and pushed down to data sources.
+ *
+ * Currently this only handles cases where `fromType` (of `fromExp`) and 
`toType` are of integral
+ * types (i.e., byte, short, int and long). The rule checks to see if the 
literal `value` is
+ * within range `(min, max)`, where `min` and `max` are the minimum and 
maximum value of
+ * `fromType`, respectively. If this is true then it means we can safely cast 
`value` to `fromType`
+ * and thus able to move the cast to the literal side.
+ *
+ * If the `value` is not within range `(min, max)`, the rule breaks the 
scenario into different
+ * cases and try to replace each with simpler constructs.
+ *
+ * if `value > max`, the cases are of following:
+ *  - `cast(exp, ty) > value` ==> if(isnull(exp), null, false)
+ *  - `cast(exp, ty) >= value` ==> if(isnull(exp), null, false)
+ *  - `cast(exp, ty) === value` ==> if(isnull(exp), null, false)
+ *  - `cast(exp, ty) <=> value` ==> false
+ *  - `cast(exp, ty) <= value` ==> if(isnull(exp), null, true)
+ *  - `cast(exp, ty) < value` ==> if(isnull(exp), null, true)
+ *
+ * if `value == max`, the cases are of following:
+ *  - `cast(exp, ty) > value` ==> if(isnull(exp), null, false)
+ *  - `cast(exp, ty) >= value` ==> exp == max
+ *  - `cast(exp, ty) === value` ==> exp == max
+ *  - `cast(exp, ty) <=> value` ==> exp == max
+ *  - `cast(exp, ty) <= value` ==> if(isnull(exp), null, true)
+ *  - `cast(exp, ty) < value` ==> exp =!= max
+ *
+ * Similarly for the cases when `value == min` and `value < min`.
+ *
+ * Further, the above `if(isnull(exp), null, false)` is represented using 
conjunction
+ * `and(isnull(exp), null)`, to enable further optimization and filter 
pushdown to data sources.
+ * Similarly, `if(isnull(exp), null, true)` is represented with 
`or(isnotnull(exp), null)`.
+ */
+object UnwrapCast extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case l: LogicalPlan => l transformExpressionsUp {
+  case e @ BinaryComparison(_, _) => unwrapCast(e)
+}
+  }
+
+  private def unwrapCast(exp: Expression): Expression = exp match {
+case BinaryComparison(Literal(_, _), Cast(_, _, _)) =>
+  // Not a canonical form. In this case we first canonicalize the 
expression by swapping the
+  // literal and cast side, then process the result and swap the literal 
and cast again to
+  // restore the original order.
+  def swap(e: Expression): Expression = e match {
+case GreaterThan(left, right) => LessThan(right, left)
+case GreaterThanOrEqual(left, right) => LessThanOrEqual(right, left)
+case EqualTo(left, right) => EqualTo(right, left)
+case EqualNullSafe(left, right) => EqualNullSafe(right, left)
+case LessThanOrEqual(left, right) => GreaterThanOrEqual(right, left)
+case LessThan(left, right) => GreaterThan(right, left)
+case _ => e
+  }
+
+  swap(unwrapCast(swap(exp)))
+
+case BinaryComparison(Cast(fromExp, _, _), Literal(value, toType))
+  if canImplicitlyCast(fromExp, toType) =>
+
+  // In case both sides have integral type, o

[GitHub] [spark] imback82 commented on a change in pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


imback82 commented on a change in pull request #29473:
URL: https://github.com/apache/spark/pull/29473#discussion_r479590003



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceOrRepartitionBucketsInJoin.scala
##
@@ -83,23 +82,39 @@ case class CoalesceBucketsInJoin(conf: SQLConf) extends 
Rule[SparkPlan] {
   }
 
   def apply(plan: SparkPlan): SparkPlan = {
-if (!conf.coalesceBucketsInJoinEnabled) {
+if (!conf.coalesceBucketsInJoinEnabled && 
!conf.repartitionBucketsInJoinEnabled) {
   return plan
 }
 
+if (conf.coalesceBucketsInJoinEnabled && 
conf.repartitionBucketsInJoinEnabled) {
+  throw new AnalysisException("Both 
'spark.sql.bucketing.coalesceBucketsInJoin.enabled' and " +
+"'spark.sql.bucketing.repartitionBucketsInJoin.enabled' cannot be set 
to true at the" +
+"same time")

Review comment:
   Thanks for the suggestion! I think the new config makes more sense. I 
renamed few, and let me know if it doesn't make sense.
   
   Btw, do you think I can introduce `AUTOMATIC` as a follow up since this PR 
is sizable? Let me know if you want to see it in this PR. Thanks.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29473:
URL: https://github.com/apache/spark/pull/29473#issuecomment-683211853







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29473:
URL: https://github.com/apache/spark/pull/29473#issuecomment-683211853







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable

2020-08-28 Thread GitBox


SparkQA commented on pull request #29473:
URL: https://github.com/apache/spark/pull/29473#issuecomment-683211632


   **[Test build #128011 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128011/testReport)**
 for PR 29473 at commit 
[`e2374ac`](https://github.com/apache/spark/commit/e2374ac281bbcb23c0dc49786ce7d8148f9761bd).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29565: [SPARK-24994][SQL] Simplify casts for literal types

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29565:
URL: https://github.com/apache/spark/pull/29565#issuecomment-683211038







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29565: [SPARK-24994][SQL] Simplify casts for literal types

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29565:
URL: https://github.com/apache/spark/pull/29565#issuecomment-683211038







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2020-08-28 Thread GitBox


sunchao commented on pull request #29542:
URL: https://github.com/apache/spark/pull/29542#issuecomment-683210745


   > Parquet reader is performance-wise important component in Spark SQL. We 
better to make sure no performance regression due to this change. Should we run 
a benchmark to check it?
   
   Sure let me do that. Should I just run `DataSourceReadBenchmark`?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29565: [SPARK-24994][SQL] Simplify casts for literal types

2020-08-28 Thread GitBox


SparkQA removed a comment on pull request #29565:
URL: https://github.com/apache/spark/pull/29565#issuecomment-683142261


   **[Test build #128001 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128001/testReport)**
 for PR 29565 at commit 
[`fc4311b`](https://github.com/apache/spark/commit/fc4311b8f9ef52de1e0ea623ba239a66b03dbd50).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29565: [SPARK-24994][SQL] Simplify casts for literal types

2020-08-28 Thread GitBox


SparkQA commented on pull request #29565:
URL: https://github.com/apache/spark/pull/29565#issuecomment-683210663


   **[Test build #128001 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128001/testReport)**
 for PR 29565 at commit 
[`fc4311b`](https://github.com/apache/spark/commit/fc4311b8f9ef52de1e0ea623ba239a66b03dbd50).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a change in pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2020-08-28 Thread GitBox


sunchao commented on a change in pull request #29542:
URL: https://github.com/apache/spark/pull/29542#discussion_r479588691



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
##
@@ -22,22 +22,20 @@
 import java.io.IOException;
 import java.lang.reflect.InvocationTargetException;
 import java.util.ArrayList;
-import java.util.Arrays;
 import java.util.Collections;
 import java.util.HashMap;
 import java.util.HashSet;
 import java.util.List;
 import java.util.Map;
 import java.util.Set;
 
+import org.apache.hadoop.mapred.FileSplit;
+import org.apache.parquet.HadoopReadOptions;
+import org.apache.parquet.ParquetReadOptions;
+import org.apache.parquet.hadoop.util.HadoopInputFile;
+import org.apache.spark.sql.internal.SQLConf;

Review comment:
   Sure. It's annoying that my Intellij did this. Do you usually run 
`dev/scalafmt` before submitting PR? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a change in pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2020-08-28 Thread GitBox


sunchao commented on a change in pull request #29542:
URL: https://github.com/apache/spark/pull/29542#discussion_r479587433



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
##
@@ -92,60 +88,24 @@
   public void initialize(InputSplit inputSplit, TaskAttemptContext 
taskAttemptContext)
   throws IOException, InterruptedException {
 Configuration configuration = taskAttemptContext.getConfiguration();
-ParquetInputSplit split = (ParquetInputSplit)inputSplit;
+FileSplit split = (FileSplit) inputSplit;
 this.file = split.getPath();
-long[] rowGroupOffsets = split.getRowGroupOffsets();
 
-ParquetMetadata footer;
-List blocks;
-
-// if task.side.metadata is set, rowGroupOffsets is null
-if (rowGroupOffsets == null) {
-  // then we need to apply the predicate push down filter
-  footer = readFooter(configuration, file, range(split.getStart(), 
split.getEnd()));
-  MessageType fileSchema = footer.getFileMetaData().getSchema();
-  FilterCompat.Filter filter = getFilter(configuration);

Review comment:
   No. Currently we are doing double filter pushdowns: once in here and 
another in the `ParquetFileReader` ctor. This removes the first one.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683203442







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683203442







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29572: [WIP][SPARK-32730][SQL] Improve LeftSemi SortMergeJoin right side buffering

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29572:
URL: https://github.com/apache/spark/pull/29572#issuecomment-683203012







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


SparkQA commented on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683203194


   **[Test build #128010 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128010/testReport)**
 for PR 29574 at commit 
[`f1b46f3`](https://github.com/apache/spark/commit/f1b46f30d7c7d237c5c16a71ada3456c44089adc).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29572: [WIP][SPARK-32730][SQL] Improve LeftSemi SortMergeJoin right side buffering

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29572:
URL: https://github.com/apache/spark/pull/29572#issuecomment-683203012







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29572: [WIP][SPARK-32730][SQL] Improve LeftSemi SortMergeJoin right side buffering

2020-08-28 Thread GitBox


SparkQA removed a comment on pull request #29572:
URL: https://github.com/apache/spark/pull/29572#issuecomment-683129162


   **[Test build #128000 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128000/testReport)**
 for PR 29572 at commit 
[`1a49356`](https://github.com/apache/spark/commit/1a49356b8b7e021c1c8f7176e70b3d26ce8fc491).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29572: [WIP][SPARK-32730][SQL] Improve LeftSemi SortMergeJoin right side buffering

2020-08-28 Thread GitBox


SparkQA commented on pull request #29572:
URL: https://github.com/apache/spark/pull/29572#issuecomment-683202640


   **[Test build #128000 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128000/testReport)**
 for PR 29572 at commit 
[`1a49356`](https://github.com/apache/spark/commit/1a49356b8b7e021c1c8f7176e70b3d26ce8fc491).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
 * `trait AppendOnlyUnsafeRowArray `



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2020-08-28 Thread GitBox


viirya commented on a change in pull request #29542:
URL: https://github.com/apache/spark/pull/29542#discussion_r479583247



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
##
@@ -92,60 +88,24 @@
   public void initialize(InputSplit inputSplit, TaskAttemptContext 
taskAttemptContext)
   throws IOException, InterruptedException {
 Configuration configuration = taskAttemptContext.getConfiguration();
-ParquetInputSplit split = (ParquetInputSplit)inputSplit;
+FileSplit split = (FileSplit) inputSplit;
 this.file = split.getPath();
-long[] rowGroupOffsets = split.getRowGroupOffsets();
 
-ParquetMetadata footer;
-List blocks;
-
-// if task.side.metadata is set, rowGroupOffsets is null
-if (rowGroupOffsets == null) {
-  // then we need to apply the predicate push down filter
-  footer = readFooter(configuration, file, range(split.getStart(), 
split.getEnd()));
-  MessageType fileSchema = footer.getFileMetaData().getSchema();
-  FilterCompat.Filter filter = getFilter(configuration);

Review comment:
   Does this mean we don't have predicate push down anymore?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29576:
URL: https://github.com/apache/spark/pull/29576#issuecomment-683200845







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


SparkQA commented on pull request #29576:
URL: https://github.com/apache/spark/pull/29576#issuecomment-683201784


   **[Test build #128009 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128009/testReport)**
 for PR 29576 at commit 
[`d015266`](https://github.com/apache/spark/commit/d015266e1d275ff9cc2c1e75a11267079827bd3d).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


maropu commented on pull request #29576:
URL: https://github.com/apache/spark/pull/29576#issuecomment-683201606


   Thanks! LGTM



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2020-08-28 Thread GitBox


viirya commented on a change in pull request #29542:
URL: https://github.com/apache/spark/pull/29542#discussion_r479582689



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
##
@@ -22,22 +22,20 @@
 import java.io.IOException;
 import java.lang.reflect.InvocationTargetException;
 import java.util.ArrayList;
-import java.util.Arrays;
 import java.util.Collections;
 import java.util.HashMap;
 import java.util.HashSet;
 import java.util.List;
 import java.util.Map;
 import java.util.Set;
 
+import org.apache.hadoop.mapred.FileSplit;
+import org.apache.parquet.HadoopReadOptions;
+import org.apache.parquet.ParquetReadOptions;
+import org.apache.parquet.hadoop.util.HadoopInputFile;
+import org.apache.spark.sql.internal.SQLConf;

Review comment:
   Could you move these imports below `scala.Option;` to the third-party 
import group?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29576:
URL: https://github.com/apache/spark/pull/29576#issuecomment-683200845







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


viirya commented on pull request #29576:
URL: https://github.com/apache/spark/pull/29576#issuecomment-683200720


   cc @maropu 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya opened a new pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


viirya opened a new pull request #29576:
URL: https://github.com/apache/spark/pull/29576


   
   
   ### What changes were proposed in this pull request?
   
   
   This PR changes key data types check in `HashJoin` to use `sameType`. This 
backports #29555 to branch-2.4.
   
   ### Why are the changes needed?
   
   
   Looks at the resolving condition of `SetOperation`, it requires only each 
left data types should be `sameType` as the right ones. Logically the `EqualTo` 
expression in equi-join, also requires only left data type `sameType` as right 
data type. Then `HashJoin` requires left keys data type exactly the same as 
right keys data type, looks not reasonable.
   
   It makes inconsistent results when doing `except` between two dataframes.
   
   If two dataframes don't have nested fields, even their field nullable 
property different, `HashJoin` passes the key type check because it checks 
field individually so field nullable property is ignored.
   
   If two dataframes have nested fields like struct, `HashJoin` fails the key 
type check because now it compare two struct types and nullable property now 
affects. 
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   Yes. Making consistent `except` operation between dataframes.
   
   ### How was this patch tested?
   
   
   Unit test.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #28106: [SPARK-31335][SQL] Add try function support

2020-08-28 Thread GitBox


github-actions[bot] closed pull request #28106:
URL: https://github.com/apache/spark/pull/28106


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #28491: [SPARK-30267][SQL] Interoperability tests with Avro records generated by Avro4s

2020-08-28 Thread GitBox


github-actions[bot] commented on pull request #28491:
URL: https://github.com/apache/spark/pull/28491#issuecomment-683200263


   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #28574: [SPARK-31752][SQL][DOCS] Add sql doc for interval type

2020-08-28 Thread GitBox


github-actions[bot] commented on pull request #28574:
URL: https://github.com/apache/spark/pull/28574#issuecomment-683200258


   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #27498: [SPARK-30688][SQL] Week based dates not being parsed with TimestampFormatter

2020-08-28 Thread GitBox


github-actions[bot] closed pull request #27498:
URL: https://github.com/apache/spark/pull/27498


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29575:
URL: https://github.com/apache/spark/pull/29575#issuecomment-683199099







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29575:
URL: https://github.com/apache/spark/pull/29575#issuecomment-683199099







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


SparkQA commented on pull request #29575:
URL: https://github.com/apache/spark/pull/29575#issuecomment-683198820


   **[Test build #128008 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128008/testReport)**
 for PR 29575 at commit 
[`141c8f3`](https://github.com/apache/spark/commit/141c8f3eecea97ecc75ee02806566a9f29f41af8).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


viirya commented on pull request #29575:
URL: https://github.com/apache/spark/pull/29575#issuecomment-683198529


   cc @maropu 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya opened a new pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property

2020-08-28 Thread GitBox


viirya opened a new pull request #29575:
URL: https://github.com/apache/spark/pull/29575


   
   
   ### What changes were proposed in this pull request?
   
   
   This PR changes key data types check in `HashJoin` to use `sameType`. This 
backports #29555 to branch-3.0.
   
   ### Why are the changes needed?
   
   
   Looks at the resolving condition of `SetOperation`, it requires only each 
left data types should be `sameType` as the right ones. Logically the `EqualTo` 
expression in equi-join, also requires only left data type `sameType` as right 
data type. Then `HashJoin` requires left keys data type exactly the same as 
right keys data type, looks not reasonable.
   
   It makes inconsistent results when doing `except` between two dataframes.
   
   If two dataframes don't have nested fields, even their field nullable 
property different, `HashJoin` passes the key type check because it checks 
field individually so field nullable property is ignored.
   
   If two dataframes have nested fields like struct, `HashJoin` fails the key 
type check because now it compare two struct types and nullable property now 
affects. 
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   Yes. Making consistent `except` operation between dataframes.
   
   ### How was this patch tested?
   
   
   Unit test.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683197456







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683197456







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


SparkQA commented on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683197133


   **[Test build #128007 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128007/testReport)**
 for PR 29574 at commit 
[`85fa83a`](https://github.com/apache/spark/commit/85fa83ad978fd0ae2b757fd2d72272dc54e3089b).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683196705







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


AmplabJenkins commented on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683196705







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


SparkQA commented on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683196636


   **[Test build #128006 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128006/testReport)**
 for PR 29574 at commit 
[`ff3d769`](https://github.com/apache/spark/commit/ff3d76981285f5f752b46617f7a579e6ce85c7fe).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29574: Fix some R styles

2020-08-28 Thread GitBox


SparkQA removed a comment on pull request #29574:
URL: https://github.com/apache/spark/pull/29574#issuecomment-683188721


   **[Test build #128006 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128006/testReport)**
 for PR 29574 at commit 
[`ff3d769`](https://github.com/apache/spark/commit/ff3d76981285f5f752b46617f7a579e6ce85c7fe).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >