[GitHub] [spark] SparkQA commented on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC

2021-08-18 Thread GitBox


SparkQA commented on pull request #33748:
URL: https://github.com/apache/spark/pull/33748#issuecomment-901630384


   **[Test build #142639 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142639/testReport)**
 for PR 33748 at commit 
[`e5f9497`](https://github.com/apache/spark/commit/e5f94971a4af723440e4cac4fa4bfbe6fff018b9).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #33749: [SPARK-36519][SS]Store RocksDB format version in the checkpoint for streaming queries

2021-08-18 Thread GitBox


gengliangwang commented on pull request #33749:
URL: https://github.com/apache/spark/pull/33749#issuecomment-901630319


   I will cut 3.2.0 RC1 after this one is merged. 
   cc @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a change in pull request #33752: [SPARK-36401][PYTHON] Implement Series.cov

2021-08-18 Thread GitBox


itholic commented on a change in pull request #33752:
URL: https://github.com/apache/spark/pull/33752#discussion_r691803163



##
File path: python/pyspark/pandas/series.py
##
@@ -944,6 +944,57 @@ def between(self, left: Any, right: Any, inclusive: bool = 
True) -> "Series":
 
 return lmask & rmask
 
+def cov(self, other: "Series", min_periods: int = 1) -> float:
+"""
+Compute covariance with Series, excluding missing values.
+Parameters
+--
+other : Series
+Series with which to compute the covariance.
+min_periods : int, default 1
+Minimum number of observations needed to have a valid result. None 
= 1.
+
+Returns
+---
+float
+Covariance between Series and other
+
+Examples
+
+>>> from pyspark.pandas.config import set_option, reset_option
+>>> set_option("compute.ops_on_diff_frames", True)
+>>> s1 = ps.Series([0.90010907, 0.13484424, 0.62036035])
+>>> s2 = ps.Series([0.12528585, 0.26962463, 0.5198])
+>>> s1.cov(s2)
+-0.016857626527158744
+>>> reset_option("compute.ops_on_diff_frames")
+"""
+
+if min_periods is None:
+min_periods = 1
+
+if same_anchor(self, other):
+self_column_label = verify_temp_column_name(other.to_frame(), 
"__self_column__")
+other_column_label = verify_temp_column_name(self.to_frame(), 
"__other_column__")
+combined = DataFrame(
+self._internal.with_new_columns(
+[self.rename(self_column_label), 
other.rename(other_column_label)]
+)
+)

Review comment:
   AFAIK, `count` also collect the data into each node and accumulate them.
   
   Generally `head` is faster if you don't need to scan the entire data.
   
   `count` always requires to scan all data.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33744: [SPARK-36403][PYTHON] Implement `Index.putmask`

2021-08-18 Thread GitBox


SparkQA commented on pull request #33744:
URL: https://github.com/apache/spark/pull/33744#issuecomment-901628569


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47137/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #33769: [SPARK-36536][SQL] Use CAST for datetime in CSV/JSON by default

2021-08-18 Thread GitBox


HyukjinKwon commented on pull request #33769:
URL: https://github.com/apache/spark/pull/33769#issuecomment-901627463


   should we maybe update SQL migration guide?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #33784: Revert "[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache"

2021-08-18 Thread GitBox


HyukjinKwon commented on pull request #33784:
URL: https://github.com/apache/spark/pull/33784#issuecomment-901625868


   hm I was thinking that we'd keep that fix since that's already merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


SparkQA commented on pull request #33650:
URL: https://github.com/apache/spark/pull/33650#issuecomment-901624380


   **[Test build #142638 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142638/testReport)**
 for PR 33650 at commit 
[`8907dea`](https://github.com/apache/spark/commit/8907deaf39e5b0c954f6ad4750a0c63cf5fc72e7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33784: Revert "[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache"

2021-08-18 Thread GitBox


SparkQA commented on pull request #33784:
URL: https://github.com/apache/spark/pull/33784#issuecomment-901624206


   **[Test build #142637 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142637/testReport)**
 for PR 33784 at commit 
[`756c905`](https://github.com/apache/spark/commit/756c905dd2d0d6bba2acb8e468320dda90fd73c3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


sunchao commented on a change in pull request #33650:
URL: https://github.com/apache/spark/pull/33650#discussion_r691797436



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala
##
@@ -57,6 +64,27 @@ abstract class FileScanBuilder(
 StructType(fields)
   }
 
+  def pushFilters(partitionFilters: Seq[Expression], dataFilters: 
Seq[Expression]): Unit = {
+this.partitionFilters = partitionFilters
+this.dataFilters = dataFilters
+this.pushedDataFilters = pushDataFilters(dataFilters)

Review comment:
   I feel the same way, and this can still allow us to have file data 
sources implementing `SupportsPushDownFilters`, correct? it feels weird that 
built-in data sources don't use the V2 API.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #30483: [SPARK-33449][SQL] Support File Metadata Cache for Parquet

2021-08-18 Thread GitBox


LuciferYang commented on a change in pull request #30483:
URL: https://github.com/apache/spark/pull/30483#discussion_r691796638



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -967,6 +967,20 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val FILE_META_CACHE_PARQUET_ENABLED = 
buildConf("spark.sql.fileMetaCache.parquet.enabled")
+.doc("To indicate if enable parquet file meta cache, it is recommended to 
enabled " +
+  "this config when multiple queries are performed on the same dataset, 
default is false.")
+.version("3.3.0")
+.booleanConf
+.createWithDefault(false)
+
+  val FILE_META_CACHE_TTL_SINCE_LAST_ACCESS =

Review comment:
   good suggestion




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #33650:
URL: https://github.com/apache/spark/pull/33650#issuecomment-901622894






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33664: [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning

2021-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #33664:
URL: https://github.com/apache/spark/pull/33664#issuecomment-901622891


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47133/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33744: [SPARK-36403][PYTHON] Implement `Index.putmask`

2021-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #33744:
URL: https://github.com/apache/spark/pull/33744#issuecomment-901622889


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142636/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #33744: [SPARK-36403][PYTHON] Implement `Index.putmask`

2021-08-18 Thread GitBox


AmplabJenkins commented on pull request #33744:
URL: https://github.com/apache/spark/pull/33744#issuecomment-901622889


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142636/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #33664: [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning

2021-08-18 Thread GitBox


AmplabJenkins commented on pull request #33664:
URL: https://github.com/apache/spark/pull/33664#issuecomment-901622891


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47133/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


AmplabJenkins commented on pull request #33650:
URL: https://github.com/apache/spark/pull/33650#issuecomment-901622897






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC

2021-08-18 Thread GitBox


SparkQA commented on pull request #33748:
URL: https://github.com/apache/spark/pull/33748#issuecomment-901622111


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47136/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #33784: Revert "[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache"

2021-08-18 Thread GitBox


LuciferYang commented on pull request #33784:
URL: https://github.com/apache/spark/pull/33784#issuecomment-901621745


   cc @gatorsmile @HyukjinKwon @zsxwing @sarutak  @sunchao @holdenk @mridulm 
@dongjoon-hyun and `RemoteBlockPushResolver` has some conflicts, I solved it 
manually. please help review  ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang closed pull request #33741: [SPARK-36512][UI][TESTS] Fix UISeleniumSuite in sql/hive-thriftserver

2021-08-18 Thread GitBox


gengliangwang closed pull request #33741:
URL: https://github.com/apache/spark/pull/33741


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #33741: [SPARK-36512][UI][TESTS] Fix UISeleniumSuite in sql/hive-thriftserver

2021-08-18 Thread GitBox


gengliangwang commented on pull request #33741:
URL: https://github.com/apache/spark/pull/33741#issuecomment-901621425


   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request #33784: Revert "[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache"

2021-08-18 Thread GitBox


LuciferYang opened a new pull request #33784:
URL: https://github.com/apache/spark/pull/33784


   ### What changes were proposed in this pull request?
   This pr revert the change of SPARK-34309, includes:
   
   - https://github.com/apache/spark/pull/31517
   - https://github.com/apache/spark/pull/33772
   
   
   ### Why are the changes needed?
   
   1. No really performance improvement in Spark 
   2. Added an additional dependency 
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass the Jenkins or GitHub Action


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


SparkQA removed a comment on pull request #33650:
URL: https://github.com/apache/spark/pull/33650#issuecomment-901519101


   **[Test build #142630 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142630/testReport)**
 for PR 33650 at commit 
[`2cadefb`](https://github.com/apache/spark/commit/2cadefbd4a42e73578634ecb3c04ebe828f48530).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #33744: [SPARK-36403][PYTHON] Implement `Index.putmask`

2021-08-18 Thread GitBox


SparkQA removed a comment on pull request #33744:
URL: https://github.com/apache/spark/pull/33744#issuecomment-901604941


   **[Test build #142636 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142636/testReport)**
 for PR 33744 at commit 
[`ac455de`](https://github.com/apache/spark/commit/ac455de560e4e7be1a472abbc0aa1a9907cbdd1a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


SparkQA commented on pull request #33650:
URL: https://github.com/apache/spark/pull/33650#issuecomment-901619291


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47135/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC

2021-08-18 Thread GitBox


SparkQA commented on pull request #33748:
URL: https://github.com/apache/spark/pull/33748#issuecomment-901618376


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47134/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33744: [SPARK-36403][PYTHON] Implement `Index.putmask`

2021-08-18 Thread GitBox


SparkQA commented on pull request #33744:
URL: https://github.com/apache/spark/pull/33744#issuecomment-901614541


   **[Test build #142636 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142636/testReport)**
 for PR 33744 at commit 
[`ac455de`](https://github.com/apache/spark/commit/ac455de560e4e7be1a472abbc0aa1a9907cbdd1a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33664: [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning

2021-08-18 Thread GitBox


SparkQA commented on pull request #33664:
URL: https://github.com/apache/spark/pull/33664#issuecomment-901612824


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47133/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dgd-contributor commented on pull request #33536: [SPARK-36101][CORE] Grouping exception in core/api

2021-08-18 Thread GitBox


dgd-contributor commented on pull request #33536:
URL: https://github.com/apache/spark/pull/33536#issuecomment-901612747


   @cloud-fan hi, can you review this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


SparkQA commented on pull request #33650:
URL: https://github.com/apache/spark/pull/33650#issuecomment-901611530


   **[Test build #142630 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142630/testReport)**
 for PR 33650 at commit 
[`2cadefb`](https://github.com/apache/spark/commit/2cadefbd4a42e73578634ecb3c04ebe828f48530).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on pull request #33783: [MINOR][DOCS] Mention Hadoop 3 in YARN introduction on cluster-overview.md

2021-08-18 Thread GitBox


sarutak commented on pull request #33783:
URL: https://github.com/apache/spark/pull/33783#issuecomment-901609919


   Are there any similar places where we should modify in bulk?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


cloud-fan commented on a change in pull request #33650:
URL: https://github.com/apache/spark/pull/33650#discussion_r691778707



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
##
@@ -242,4 +244,21 @@ object DataSourceUtils {
   options
 }
   }
+
+  def getPartitionKeyFiltersAndDataFilters(
+  sparkSession: SparkSession,

Review comment:
   it seems we only need `conf: SQLConf`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


cloud-fan commented on a change in pull request #33650:
URL: https://github.com/apache/spark/pull/33650#discussion_r691778075



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
##
@@ -2972,37 +2970,6 @@ class JsonV2Suite extends JsonSuite {
 super
   .sparkConf
   .set(SQLConf.USE_V1_SOURCE_LIST, "")
-
-  test("get pushed filters") {

Review comment:
   can we rewrite this test and make it work?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


cloud-fan commented on a change in pull request #33650:
URL: https://github.com/apache/spark/pull/33650#discussion_r691777367



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala
##
@@ -57,6 +64,27 @@ abstract class FileScanBuilder(
 StructType(fields)
   }
 
+  def pushFilters(partitionFilters: Seq[Expression], dataFilters: 
Seq[Expression]): Unit = {
+this.partitionFilters = partitionFilters
+this.dataFilters = dataFilters
+this.pushedDataFilters = pushDataFilters(dataFilters)

Review comment:
   We can translate the data filters here before passing to 
`pushDataFilters`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


cloud-fan commented on a change in pull request #33650:
URL: https://github.com/apache/spark/pull/33650#discussion_r691776887



##
File path: 
external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScanBuilder.scala
##
@@ -41,17 +42,16 @@ class AvroScanBuilder (
   readDataSchema(),
   readPartitionSchema(),
   options,
-  pushedFilters())
+  pushedDataFilters,
+  partitionFilters,
+  dataFilters)
   }
 
-  private var _pushedFilters: Array[Filter] = Array.empty
-
-  override def pushFilters(filters: Array[Filter]): Array[Filter] = {
+  override def pushDataFilters(dataFilters: Seq[Expression]): Array[Filter] = {

Review comment:
   nit: the input parameter can be `dataFilters: Array[Filter]`, then we 
don't need to ask every source impl to call `translateDataFilter`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33744: [SPARK-36403][PYTHON] Implement `Index.putmask`

2021-08-18 Thread GitBox


SparkQA commented on pull request #33744:
URL: https://github.com/apache/spark/pull/33744#issuecomment-901604941


   **[Test build #142636 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142636/testReport)**
 for PR 33744 at commit 
[`ac455de`](https://github.com/apache/spark/commit/ac455de560e4e7be1a472abbc0aa1a9907cbdd1a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC

2021-08-18 Thread GitBox


SparkQA commented on pull request #33748:
URL: https://github.com/apache/spark/pull/33748#issuecomment-901603993


   **[Test build #142634 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142634/testReport)**
 for PR 33748 at commit 
[`4adeb62`](https://github.com/apache/spark/commit/4adeb628062bfc091041df845d1c2b9bd7515954).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a change in pull request #30483: [SPARK-33449][SQL] Support File Metadata Cache for Parquet

2021-08-18 Thread GitBox


sunchao commented on a change in pull request #30483:
URL: https://github.com/apache/spark/pull/30483#discussion_r691697807



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -967,6 +967,20 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val FILE_META_CACHE_PARQUET_ENABLED = 
buildConf("spark.sql.fileMetaCache.parquet.enabled")
+.doc("To indicate if enable parquet file meta cache, it is recommended to 
enabled " +

Review comment:
   hmm curious whether this can help if your Spark queries is running as 
separate Spark jobs, where each of them may use different executors.

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -967,6 +967,20 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val FILE_META_CACHE_PARQUET_ENABLED = 
buildConf("spark.sql.fileMetaCache.parquet.enabled")
+.doc("To indicate if enable parquet file meta cache, it is recommended to 
enabled " +
+  "this config when multiple queries are performed on the same dataset, 
default is false.")
+.version("3.3.0")
+.booleanConf
+.createWithDefault(false)
+
+  val FILE_META_CACHE_TTL_SINCE_LAST_ACCESS =

Review comment:
   nit: maybe `FILE_META_CACHE_TTL_SINCE_LAST_ACCESS_SEC` and 
`spark.sql.fileMetaCache.ttlSinceLastAccessSec` so it's easier to know that the 
unit is second?

##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
##
@@ -77,28 +82,31 @@
 
   protected ParquetFileReader reader;
 
+  protected ParquetMetadata cachedFooter;
+
   @Override
   public void initialize(InputSplit inputSplit, TaskAttemptContext 
taskAttemptContext)
   throws IOException, InterruptedException {
 Configuration configuration = taskAttemptContext.getConfiguration();
 FileSplit split = (FileSplit) inputSplit;
 this.file = split.getPath();
 
-ParquetReadOptions options = HadoopReadOptions
-  .builder(configuration)
-  .withRange(split.getStart(), split.getStart() + split.getLength())
-  .build();
-this.reader = new ParquetFileReader(HadoopInputFile.fromPath(file, 
configuration), options);
-this.fileSchema = reader.getFileMetaData().getSchema();
-Map fileMetadata = 
reader.getFileMetaData().getKeyValueMetaData();
+ParquetMetadata footer =
+  readFooterByRange(configuration, split.getStart(), split.getStart() + 
split.getLength());
+this.fileSchema = footer.getFileMetaData().getSchema();
+FilterCompat.Filter filter = ParquetInputFormat.getFilter(configuration);
+List blocks =
+  RowGroupFilter.filterRowGroups(filter, footer.getBlocks(), fileSchema);

Review comment:
   does this apply all the filter levels? e.g., stats, dictionary, and 
bloom filter.

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileMetaCacheManager.scala
##
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.util.concurrent.TimeUnit
+
+import com.github.benmanes.caffeine.cache.{CacheLoader, Caffeine}
+import com.github.benmanes.caffeine.cache.stats.CacheStats
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkEnv
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A singleton Cache Manager to caching file meta. We cache these file metas 
in order to speed up
+ * iterated queries over the same dataset. Otherwise, each query would have to 
hit remote storage
+ * in order to fetch file meta before read files.
+ *
+ * We should implement the corresponding `FileMetaKey` for a specific file 
format, for example
+ * `ParquetFileMetaKey` or `OrcFileMetaKey`. By default, the file path is used 
as the identification
+ * of the `FileMetaKey` and the `getFileMeta` method of `FileMetaKey` is used 
to return the file
+ * meta of the corresponding file format.
+ */
+object FileMetaCacheManager extends Logging {
+
+  private lazy val cacheLoader = new 

[GitHub] [spark] SparkQA commented on pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


SparkQA commented on pull request #33650:
URL: https://github.com/apache/spark/pull/33650#issuecomment-901603315


   **[Test build #142635 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142635/testReport)**
 for PR 33650 at commit 
[`ae98a11`](https://github.com/apache/spark/commit/ae98aaa6a0f14ce86faf313779728f51ecab).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


huaxingao commented on a change in pull request #33650:
URL: https://github.com/apache/spark/pull/33650#discussion_r691773116



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala
##
@@ -57,6 +63,30 @@ abstract class FileScanBuilder(
 StructType(fields)
   }
 
+  def pushFiltersToFileIndex(

Review comment:
   Sounds good. Changed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33664: [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning

2021-08-18 Thread GitBox


SparkQA commented on pull request #33664:
URL: https://github.com/apache/spark/pull/33664#issuecomment-901600120


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47133/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC

2021-08-18 Thread GitBox


LuciferYang commented on a change in pull request #33748:
URL: https://github.com/apache/spark/pull/33748#discussion_r691768337



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileMetaCacheManager.scala
##
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.util.concurrent.TimeUnit
+
+import com.github.benmanes.caffeine.cache.{CacheLoader, Caffeine}
+import com.github.benmanes.caffeine.cache.stats.CacheStats
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkEnv
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A singleton Cache Manager to caching file meta. We cache these file metas 
in order to speed up
+ * iterated queries over the same dataset. Otherwise, each query would have to 
hit remote storage
+ * in order to fetch file meta before read files.
+ *
+ * We should implement the corresponding `FileMetaKey` for a specific file 
format, for example
+ * `ParquetFileMetaKey` or `OrcFileMetaKey`. By default, the file path is used 
as the identification
+ * of the `FileMetaKey` and the `getFileMeta` method of `FileMetaKey` is used 
to return the file
+ * meta of the corresponding file format.
+ */
+object FileMetaCacheManager extends Logging {
+
+  private lazy val cacheLoader = new CacheLoader[FileMetaKey, FileMeta]() {

Review comment:
   59d5bb9 change to use Guava cache and update the benchmark results




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #33664: [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning

2021-08-18 Thread GitBox


cloud-fan commented on pull request #33664:
URL: https://github.com/apache/spark/pull/33664#issuecomment-901589931


   
https://github.com/apache/spark/commit/a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-5221c65a64ad82c34cae68169cdb389210a9a28145058ae995b46ff4d3d4964cR39
   
   We put this `OptimizeSubqueries` rule together with the DPP rule at the very 
beginning. It's kind of a mistake, as once this rule applies, we break plan 
reuse and thus break DPP.
   
   This PR LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sumeetgajjar edited a comment on pull request #33782: [SPARK-35011][CORE][3.0] Avoid Block Manager registrations when StopExecutor msg is in-flight

2021-08-18 Thread GitBox


sumeetgajjar edited a comment on pull request #33782:
URL: https://github.com/apache/spark/pull/33782#issuecomment-901588979


   Github check failed due to a unrelated UT failure.
   ```
   info] - multiple joins *** FAILED *** (1 second, 104 milliseconds)
   [info]   ArrayBuffer(BroadcastHashJoin [b#147460], [a#147469], Inner, 
BuildLeft
   ```
   I do not have permission to re-reun the checks, could someone please re-run 
them?
   
   Edit: I ran the same UT locally, it passed without any issues. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sumeetgajjar commented on pull request #33782: [SPARK-35011][CORE][3.0] Avoid Block Manager registrations when StopExecutor msg is in-flight

2021-08-18 Thread GitBox


sumeetgajjar commented on pull request #33782:
URL: https://github.com/apache/spark/pull/33782#issuecomment-901588979


   Github check failed due to a unrelated UT failure.
   ```
   info] - multiple joins *** FAILED *** (1 second, 104 milliseconds)
   [info]   ArrayBuffer(BroadcastHashJoin [b#147460], [a#147469], Inner, 
BuildLeft
   ```
   I do not have permission to re-reun the checks, could someone please re-run 
them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dgd-contributor commented on a change in pull request #33752: [SPARK-36401][PYTHON] Implement Series.cov

2021-08-18 Thread GitBox


dgd-contributor commented on a change in pull request #33752:
URL: https://github.com/apache/spark/pull/33752#discussion_r691758520



##
File path: python/pyspark/pandas/series.py
##
@@ -944,6 +944,57 @@ def between(self, left: Any, right: Any, inclusive: bool = 
True) -> "Series":
 
 return lmask & rmask
 
+def cov(self, other: "Series", min_periods: int = 1) -> float:
+"""
+Compute covariance with Series, excluding missing values.
+Parameters
+--
+other : Series
+Series with which to compute the covariance.
+min_periods : int, default 1
+Minimum number of observations needed to have a valid result. None 
= 1.
+
+Returns
+---
+float
+Covariance between Series and other
+
+Examples
+
+>>> from pyspark.pandas.config import set_option, reset_option
+>>> set_option("compute.ops_on_diff_frames", True)
+>>> s1 = ps.Series([0.90010907, 0.13484424, 0.62036035])
+>>> s2 = ps.Series([0.12528585, 0.26962463, 0.5198])
+>>> s1.cov(s2)
+-0.016857626527158744
+>>> reset_option("compute.ops_on_diff_frames")
+"""
+
+if min_periods is None:
+min_periods = 1
+
+if same_anchor(self, other):
+self_column_label = verify_temp_column_name(other.to_frame(), 
"__self_column__")
+other_column_label = verify_temp_column_name(self.to_frame(), 
"__other_column__")
+combined = DataFrame(
+self._internal.with_new_columns(
+[self.rename(self_column_label), 
other.rename(other_column_label)]
+)
+)

Review comment:
   I think sdf.count() may be better than len(sdf.head(min_periods)) 
because it not collect data to driver.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33673: [SPARK-36448][SQL] Exceptions in NoSuchItemException.scala have to be case classes

2021-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #33673:
URL: https://github.com/apache/spark/pull/33673#issuecomment-900866537


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142587/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #33673: [SPARK-36448][SQL] Exceptions in NoSuchItemException.scala have to be case classes

2021-08-18 Thread GitBox


cloud-fan commented on a change in pull request #33673:
URL: https://github.com/apache/spark/pull/33673#discussion_r691756821



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala
##
@@ -29,18 +29,24 @@ import org.apache.spark.sql.types.StructType
  * Thrown by a catalog when an item cannot be found. The analyzer will rethrow 
the exception
  * as an [[org.apache.spark.sql.AnalysisException]] with the correct position 
information.
  */
-class NoSuchDatabaseException(
-val db: String) extends NoSuchNamespaceException(s"Database '$db' not 
found")
+case class NoSuchDatabaseException(db: String)
+  extends AnalysisException(s"Database '$db' not found")

Review comment:
   I can't think of a better way. AFAIK it's an ill pattern to extend a 
case class in Scala.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #33736: [SPARK-35991][SQL] Add PlanStability suite for TPCH

2021-08-18 Thread GitBox


cloud-fan closed pull request #33736:
URL: https://github.com/apache/spark/pull/33736


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #33736: [SPARK-35991][SQL] Add PlanStability suite for TPCH

2021-08-18 Thread GitBox


cloud-fan commented on pull request #33736:
URL: https://github.com/apache/spark/pull/33736#issuecomment-901584521


   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33599: [SPARK-36371][SQL] Support raw string literal

2021-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #33599:
URL: https://github.com/apache/spark/pull/33599#issuecomment-901570913


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47131/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #33599: [SPARK-36371][SQL] Support raw string literal

2021-08-18 Thread GitBox


cloud-fan closed pull request #33599:
URL: https://github.com/apache/spark/pull/33599


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #33599: [SPARK-36371][SQL] Support raw string literal

2021-08-18 Thread GitBox


cloud-fan commented on pull request #33599:
URL: https://github.com/apache/spark/pull/33599#issuecomment-901583747


   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33664: [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning

2021-08-18 Thread GitBox


SparkQA commented on pull request #33664:
URL: https://github.com/apache/spark/pull/33664#issuecomment-901582918


   **[Test build #142633 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142633/testReport)**
 for PR 33664 at commit 
[`0d7e228`](https://github.com/apache/spark/commit/0d7e228b42b92abf1ce15681a2b95361dac4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #33783: [MINOR][DOCS] Mention Hadoop 3 in YARN introduction on cluster-overview.md

2021-08-18 Thread GitBox


AmplabJenkins commented on pull request #33783:
URL: https://github.com/apache/spark/pull/33783#issuecomment-901582593


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC

2021-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #33748:
URL: https://github.com/apache/spark/pull/33748#issuecomment-901581916


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47132/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC

2021-08-18 Thread GitBox


AmplabJenkins commented on pull request #33748:
URL: https://github.com/apache/spark/pull/33748#issuecomment-901581916


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47132/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yutoacts commented on a change in pull request #33777: [SPARK-36538][DOCS] Fix the environment variables part in configuration.md

2021-08-18 Thread GitBox


yutoacts commented on a change in pull request #33777:
URL: https://github.com/apache/spark/pull/33777#discussion_r691749052



##
File path: docs/configuration.md
##
@@ -3075,7 +3075,7 @@ to use on each machine and maximum memory.
 Since `spark-env.sh` is a shell script, some of these can be set 
programmatically -- for example, you might
 compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface.
 
-Note: When running Spark on YARN in `cluster` mode, environment variables need 
to be set using the `spark.yarn.appMasterEnv.[EnvironmentVariableName]` 
property in your `conf/spark-defaults.conf` file.  Environment variables that 
are set in `spark-env.sh` will not be reflected in the YARN Application Master 
process in `cluster` mode.  See the [YARN-related Spark 
Properties](running-on-yarn.html#spark-properties) for more information.

Review comment:
   I think I totally misunderstood what it says.. Thank you for the 
correction.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yutoacts closed pull request #33777: [SPARK-36538][DOCS] Fix the environment variables part in configuration.md

2021-08-18 Thread GitBox


yutoacts closed pull request #33777:
URL: https://github.com/apache/spark/pull/33777


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC

2021-08-18 Thread GitBox


SparkQA commented on pull request #33748:
URL: https://github.com/apache/spark/pull/33748#issuecomment-901578737


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47132/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yutoacts commented on pull request #33568: [SPARK-36335][DOCS] Remove Local-cluster mode reference (and add a missing period)

2021-08-18 Thread GitBox


yutoacts commented on pull request #33568:
URL: https://github.com/apache/spark/pull/33568#issuecomment-901576683


   It ended up as https://github.com/apache/spark/pull/33537.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yutoacts closed pull request #33568: [SPARK-36335][DOCS] Remove Local-cluster mode reference (and add a missing period)

2021-08-18 Thread GitBox


yutoacts closed pull request #33568:
URL: https://github.com/apache/spark/pull/33568


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yutoacts opened a new pull request #33783: [MINOR][DOCS] Mention Hadoop 3 in YARN introduction on cluster-overview.md

2021-08-18 Thread GitBox


yutoacts opened a new pull request #33783:
URL: https://github.com/apache/spark/pull/33783


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dgd-contributor commented on a change in pull request #33752: [SPARK-36401][PYTHON] Implement Series.cov

2021-08-18 Thread GitBox


dgd-contributor commented on a change in pull request #33752:
URL: https://github.com/apache/spark/pull/33752#discussion_r691744731



##
File path: python/pyspark/pandas/series.py
##
@@ -944,6 +944,57 @@ def between(self, left: Any, right: Any, inclusive: bool = 
True) -> "Series":
 
 return lmask & rmask
 
+def cov(self, other: "Series", min_periods: int = 1) -> float:
+"""
+Compute covariance with Series, excluding missing values.
+Parameters
+--
+other : Series
+Series with which to compute the covariance.
+min_periods : int, default 1
+Minimum number of observations needed to have a valid result. None 
= 1.
+
+Returns
+---
+float
+Covariance between Series and other
+
+Examples
+
+>>> from pyspark.pandas.config import set_option, reset_option
+>>> set_option("compute.ops_on_diff_frames", True)
+>>> s1 = ps.Series([0.90010907, 0.13484424, 0.62036035])
+>>> s2 = ps.Series([0.12528585, 0.26962463, 0.5198])
+>>> s1.cov(s2)
+-0.016857626527158744
+>>> reset_option("compute.ops_on_diff_frames")
+"""
+
+if min_periods is None:
+min_periods = 1
+
+if same_anchor(self, other):
+self_column_label = verify_temp_column_name(other.to_frame(), 
"__self_column__")
+other_column_label = verify_temp_column_name(self.to_frame(), 
"__other_column__")
+combined = DataFrame(
+self._internal.with_new_columns(
+[self.rename(self_column_label), 
other.rename(other_column_label)]
+)
+)

Review comment:
   Thank you so much. Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #33664: [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning

2021-08-18 Thread GitBox


wangyum commented on pull request #33664:
URL: https://github.com/apache/spark/pull/33664#issuecomment-901573272


   retest this please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #33781: [SPARK-33687][SQL][DOC][FOLLOWUP] Merge the doc pages of ANALYZE TABLE and ANALYZE TABLES

2021-08-18 Thread GitBox


cloud-fan closed pull request #33781:
URL: https://github.com/apache/spark/pull/33781


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #33781: [SPARK-33687][SQL][DOC][FOLLOWUP] Merge the doc pages of ANALYZE TABLE and ANALYZE TABLES

2021-08-18 Thread GitBox


cloud-fan commented on pull request #33781:
URL: https://github.com/apache/spark/pull/33781#issuecomment-901572333


   thanks for the review, merging to master/3.2!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


cloud-fan commented on a change in pull request #33650:
URL: https://github.com/apache/spark/pull/33650#discussion_r691740952



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala
##
@@ -57,6 +63,30 @@ abstract class FileScanBuilder(
 StructType(fields)
   }
 
+  def pushFiltersToFileIndex(

Review comment:
   this pushes data filters to the underlying file format as well. How about
   ```
   protected var partitionFilters = Seq.empty[Expression]
   protected var dataFilters = Seq.empty[Expression]
   protected var pushedDataFilters = Seq.empty[Filter]
   ...
   def pushFilters(partitionFilters: Seq[Expression], dataFilters: 
Seq[Expression]): Unit = {
 this.partitionFilters = partitionFilters
 this.dataFilters = dataFilters
 this.pushedDataFilters = pushDataFilters(dataFilters)
   }
   
   protected def pushDataFilters(dataFilters: Seq[Expression]) = Nil
   ```
   
   Then file source impl can just override `pushDataFilters`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


cloud-fan commented on a change in pull request #33650:
URL: https://github.com/apache/spark/pull/33650#discussion_r691740952



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala
##
@@ -57,6 +63,30 @@ abstract class FileScanBuilder(
 StructType(fields)
   }
 
+  def pushFiltersToFileIndex(

Review comment:
   this pushes data filters to the underlying file format as well. How about
   ```
   protected var partitionFilters = Seq.empty[Expression]
   protected var dataFilters = Seq.empty[Expression]
   protected var pushedDataFilters = Seq.empty[Expression]
   ...
   def pushFilters(partitionFilters: Seq[Expression], dataFilters: 
Seq[Expression]): Unit = {
 this.partitionFilters = partitionFilters
 this.dataFilters = dataFilters
 this.pushedDataFilters = pushDataFilters(dataFilters)
   }
   
   protected def pushDataFilters(dataFilters: Seq[Expression]) = Nil
   ```
   
   Then file source impl can just override `pushDataFilters`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC

2021-08-18 Thread GitBox


LuciferYang commented on a change in pull request #33748:
URL: https://github.com/apache/spark/pull/33748#discussion_r691734091



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileMetaCacheManager.scala
##
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.util.concurrent.TimeUnit
+
+import com.github.benmanes.caffeine.cache.{CacheLoader, Caffeine}
+import com.github.benmanes.caffeine.cache.stats.CacheStats
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkEnv
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A singleton Cache Manager to caching file meta. We cache these file metas 
in order to speed up
+ * iterated queries over the same dataset. Otherwise, each query would have to 
hit remote storage
+ * in order to fetch file meta before read files.
+ *
+ * We should implement the corresponding `FileMetaKey` for a specific file 
format, for example
+ * `ParquetFileMetaKey` or `OrcFileMetaKey`. By default, the file path is used 
as the identification
+ * of the `FileMetaKey` and the `getFileMeta` method of `FileMetaKey` is used 
to return the file
+ * meta of the corresponding file format.
+ */
+object FileMetaCacheManager extends Logging {
+
+  private lazy val cacheLoader = new CacheLoader[FileMetaKey, FileMeta]() {

Review comment:
   @dongjoon-hyun will change to use Guava because SPARK-34309 will be 
revert, I need to update the benchmark results
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #33599: [SPARK-36371][SQL] Support raw string literal

2021-08-18 Thread GitBox


AmplabJenkins commented on pull request #33599:
URL: https://github.com/apache/spark/pull/33599#issuecomment-901570913


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47131/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33599: [SPARK-36371][SQL] Support raw string literal

2021-08-18 Thread GitBox


SparkQA commented on pull request #33599:
URL: https://github.com/apache/spark/pull/33599#issuecomment-901570893


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47131/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #33629: [SPARK-36407][CORE][SQL] Convert int to long to avoid potential integer multiplications overflow risk

2021-08-18 Thread GitBox


LuciferYang commented on pull request #33629:
URL: https://github.com/apache/spark/pull/33629#issuecomment-901570511


   thank @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dgd-contributor closed pull request #33779: [SPARK-36302][SQL]: Refactor thirteenth set of 20 query execution errors to use error classes

2021-08-18 Thread GitBox


dgd-contributor closed pull request #33779:
URL: https://github.com/apache/spark/pull/33779


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-databricks commented on a change in pull request #33714: [SPARK-36399][PYTHON] Implement DataFrame.combine_first

2021-08-18 Thread GitBox


xinrong-databricks commented on a change in pull request #33714:
URL: https://github.com/apache/spark/pull/33714#discussion_r691736383



##
File path: python/pyspark/pandas/tests/test_dataframe.py
##
@@ -5614,6 +5614,40 @@ def test_at_time(self):
 with self.assertRaisesRegex(TypeError, "Index must be DatetimeIndex"):
 psdf.at_time("0:15")
 
+def test_combine_first(self):

Review comment:
   Let me take a look then, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC

2021-08-18 Thread GitBox


SparkQA commented on pull request #33748:
URL: https://github.com/apache/spark/pull/33748#issuecomment-901564736


   **[Test build #142632 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142632/testReport)**
 for PR 33748 at commit 
[`c3838e6`](https://github.com/apache/spark/commit/c3838e68241d5f8409cbcc565815a494e7eb245b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC

2021-08-18 Thread GitBox


LuciferYang commented on a change in pull request #33748:
URL: https://github.com/apache/spark/pull/33748#discussion_r691734091



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileMetaCacheManager.scala
##
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.util.concurrent.TimeUnit
+
+import com.github.benmanes.caffeine.cache.{CacheLoader, Caffeine}
+import com.github.benmanes.caffeine.cache.stats.CacheStats
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkEnv
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A singleton Cache Manager to caching file meta. We cache these file metas 
in order to speed up
+ * iterated queries over the same dataset. Otherwise, each query would have to 
hit remote storage
+ * in order to fetch file meta before read files.
+ *
+ * We should implement the corresponding `FileMetaKey` for a specific file 
format, for example
+ * `ParquetFileMetaKey` or `OrcFileMetaKey`. By default, the file path is used 
as the identification
+ * of the `FileMetaKey` and the `getFileMeta` method of `FileMetaKey` is used 
to return the file
+ * meta of the corresponding file format.
+ */
+object FileMetaCacheManager extends Logging {
+
+  private lazy val cacheLoader = new CacheLoader[FileMetaKey, FileMeta]() {

Review comment:
   @dongjoon-hyun will revert to use Guava because SPARK-34309 will be 
revert, I need to update the benchmark results
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


huaxingao commented on a change in pull request #33650:
URL: https://github.com/apache/spark/pull/33650#discussion_r691733542



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala
##
@@ -38,9 +37,9 @@ object PushDownUtils extends PredicateHelper {
* @return pushed filter and post-scan filters.
*/
   def pushFilters(
-  scanBuilder: ScanBuilder,
+  scanBuilderHolder: ScanBuilderHolder,

Review comment:
   because I need the `scanBuilderHolder.relation` for 
`DataSourceUtils.getPartitionKeyFiltersAndDataFilters`

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala
##
@@ -50,8 +49,17 @@ object PushDownUtils extends PredicateHelper {
 val translatedFilters = mutable.ArrayBuffer.empty[sources.Filter]
 // Catalyst filter expression that can't be translated to data source 
filters.
 val untranslatableExprs = mutable.ArrayBuffer.empty[Expression]
+val dataFilters = r match {
+  case f: FileScanBuilder =>
+val (partitionFilters, fileDataFilters) =
+  DataSourceUtils.getPartitionKeyFiltersAndDataFilters(
+  f.getSparkSession, scanBuilderHolder.relation, 
f.readPartitionSchema(), filters)
+f.pushPartitionFilters(ExpressionSet(partitionFilters).toSeq, 
fileDataFilters)

Review comment:
   As per our offline discussion, I have made the following changes:
   - make file source v2 NOT implement `SupportsPushdownFilters` any more
   - add `pushFiltersToFileIndex` in file source v2. In this method:
   - push both Expression partition filters and Expression data filters to 
file source.
   - data filters are used for filter push down. File source translates the 
data filters from `Expression` to `Sources.Filer`, and decides which filters to 
push down.
  - partition filters are used for partition pruning.
   
   I have updated the PR description accordingly. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC

2021-08-18 Thread GitBox


LuciferYang commented on a change in pull request #33748:
URL: https://github.com/apache/spark/pull/33748#discussion_r691733053



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -967,6 +967,32 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val FILE_META_CACHE_ENABLED_SOURCE_LIST = 
buildConf("spark.sql.fileMetaCache.enabledSourceList")
+.doc("A comma-separated list of data source short names for which data 
source enabled file " +
+  "meta cache, now the file meta cache only support ORC, it is recommended 
to enabled this " +
+  "config when multiple queries are performed on the same dataset, default 
is false." +
+  "Warning: if the fileMetaCache is enabled, the existing data files 
should not be " +
+  "replaced with the same file name, otherwise there will be a risk of job 
failure or wrong " +
+  "data reading before the cache entry expires.")
+.version("3.3.0")
+.stringConf

Review comment:
   c3838e6 add `.checkValue` and test case




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #33749: [SPARK-36519][SS]Store RocksDB format version in the checkpoint for streaming queries

2021-08-18 Thread GitBox


HeartSaVioR edited a comment on pull request #33749:
URL: https://github.com/apache/spark/pull/33749#issuecomment-901562572






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #33749: [SPARK-36519][SS]Store RocksDB format version in the checkpoint for streaming queries

2021-08-18 Thread GitBox


HeartSaVioR commented on pull request #33749:
URL: https://github.com/apache/spark/pull/33749#issuecomment-901562572


   I'll merge this early tomorrow if there's no further comment, or @viirya is 
OK with this. cc. @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33588: [SPARK-36346][SQL] Support TimestampNTZ type in Orc file source

2021-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #33588:
URL: https://github.com/apache/spark/pull/33588#issuecomment-901136096


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142588/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer removed a comment on pull request #33588: [SPARK-36346][SQL] Support TimestampNTZ type in Orc file source

2021-08-18 Thread GitBox


beliefer removed a comment on pull request #33588:
URL: https://github.com/apache/spark/pull/33588#issuecomment-900137467


   ping @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer edited a comment on pull request #33588: [SPARK-36346][SQL] Support TimestampNTZ type in Orc file source

2021-08-18 Thread GitBox


beliefer edited a comment on pull request #33588:
URL: https://github.com/apache/spark/pull/33588#issuecomment-901186384


   ping @gengliangwang @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33714: [SPARK-36399][PYTHON] Implement DataFrame.combine_first

2021-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #33714:
URL: https://github.com/apache/spark/pull/33714#issuecomment-901561191


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47129/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #33782: [SPARK-35011][CORE][3.0] Avoid Block Manager registrations when StopExecutor msg is in-flight

2021-08-18 Thread GitBox


AmplabJenkins commented on pull request #33782:
URL: https://github.com/apache/spark/pull/33782#issuecomment-901561551


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #33714: [SPARK-36399][PYTHON] Implement DataFrame.combine_first

2021-08-18 Thread GitBox


AmplabJenkins commented on pull request #33714:
URL: https://github.com/apache/spark/pull/33714#issuecomment-901561191


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47129/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sumeetgajjar commented on pull request #33770: [SPARK-34949][CORE][3.0] Prevent BlockManager reregister when Executor is shutting down

2021-08-18 Thread GitBox


sumeetgajjar commented on pull request #33770:
URL: https://github.com/apache/spark/pull/33770#issuecomment-901560073


   Thank you @dongjoon-hyun and @holdenk for taking a look at this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sumeetgajjar commented on pull request #33771: [SPARK-35011][CORE][3.1] Avoid Block Manager registrations when StopExecutor msg is in-flight

2021-08-18 Thread GitBox


sumeetgajjar commented on pull request #33771:
URL: https://github.com/apache/spark/pull/33771#issuecomment-901559674


   Thank you @dongjoon-hyun and @zhuqi-lucas for approving this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sumeetgajjar commented on pull request #33782: [SPARK-35011][CORE][3.0] Avoid Block Manager registrations when StopExecutor msg is in-flight

2021-08-18 Thread GitBox


sumeetgajjar commented on pull request #33782:
URL: https://github.com/apache/spark/pull/33782#issuecomment-901559419


   @dongjoon-hyun @mridulm @Ngone51 
   Could you please take a look at this backport PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sumeetgajjar opened a new pull request #33782: [SPARK-35011][CORE][3.0] Avoid Block Manager registrations when StopExecutor msg is in-flight

2021-08-18 Thread GitBox


sumeetgajjar opened a new pull request #33782:
URL: https://github.com/apache/spark/pull/33782


   This PR backports #32114 to 3.0
   
   
   
   
   ### What changes were proposed in this pull request?
   
   This patch proposes a fix to prevent triggering BlockManager reregistration 
while `StopExecutor` msg is in-flight.
   Here on receiving `StopExecutor` msg, we do not remove the corresponding 
`BlockManagerInfo` from `blockManagerInfo` map, instead we mark it as dead by 
updating the corresponding `executorRemovalTs`. There's a separate cleanup 
thread running to periodically remove the stale `BlockManagerInfo` from 
`blockManangerInfo` map.
   
   Now if a recently removed `BlockManager` tries to register, the driver 
simply ignores it since the `blockManagerInfo` map already contains an entry 
for it. The same applies to `BlockManagerHeartbeat`, if the BlockManager 
belongs to a recently removed executor, the `blockManagerInfo` map would 
contain an entry and we shall not ask the corresponding `BlockManager` to 
re-register.
   
   
   ### Why are the changes needed?
   
   This changes are needed since BlockManager reregistration while executor is 
shutting down causes inconsistent bookkeeping of executors in Spark.
   Consider the following scenario:
   - `CoarseGrainedSchedulerBackend` issues async `StopExecutor` on 
executorEndpoint
   - `CoarseGrainedSchedulerBackend` removes that executor from Driver's 
internal data structures and publishes `SparkListenerExecutorRemoved` on the 
`listenerBus`.
   - Executor has still not processed `StopExecutor` from the Driver
   - Driver receives heartbeat from the Executor, since it cannot find the 
`executorId` in its data structures, it responds with 
`HeartbeatResponse(reregisterBlockManager = true)`
   - `BlockManager` on the Executor reregisters with the `BlockManagerMaster` 
and `SparkListenerBlockManagerAdded` is published on the `listenerBus`
   - Executor starts processing the `StopExecutor` and exits
   - `AppStatusListener` picks the `SparkListenerBlockManagerAdded` event and 
updates `AppStatusStore`
   - `statusTracker.getExecutorInfos` refers `AppStatusStore` to get the list 
of executors which returns the dead executor as alive.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   
   - Modified the existing unittests.
   - Ran a simple test application on minikube that asserts on number of 
executors are zero once the executor idle timeout is reached.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33599: [SPARK-36371][SQL] Support raw string literal

2021-08-18 Thread GitBox


SparkQA commented on pull request #33599:
URL: https://github.com/apache/spark/pull/33599#issuecomment-901558167


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47131/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #30057: [SPARK-32838][SQL]Check DataSource insert command path with actual path

2021-08-18 Thread GitBox


AngersZh commented on pull request #30057:
URL: https://github.com/apache/spark/pull/30057#issuecomment-901556045


   gentle ping @cloud-fan @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33714: [SPARK-36399][PYTHON] Implement DataFrame.combine_first

2021-08-18 Thread GitBox


SparkQA commented on pull request #33714:
URL: https://github.com/apache/spark/pull/33714#issuecomment-901555483


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47129/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33599: [SPARK-36371][SQL] Support raw string literal

2021-08-18 Thread GitBox


SparkQA commented on pull request #33599:
URL: https://github.com/apache/spark/pull/33599#issuecomment-901541894


   **[Test build #142631 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142631/testReport)**
 for PR 33599 at commit 
[`ec963ef`](https://github.com/apache/spark/commit/ec963efa51cd02cf6816d4eebcf645c709e43f09).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33599: [SPARK-36371][SQL] Support raw string literal

2021-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #33599:
URL: https://github.com/apache/spark/pull/33599#issuecomment-901276573






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2

2021-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #33650:
URL: https://github.com/apache/spark/pull/33650#issuecomment-901539496






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33714: [SPARK-36399][PYTHON] Implement DataFrame.combine_first

2021-08-18 Thread GitBox


AmplabJenkins removed a comment on pull request #33714:
URL: https://github.com/apache/spark/pull/33714#issuecomment-901539497






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on pull request #33599: [SPARK-36371][SQL] Support raw string literal

2021-08-18 Thread GitBox


sarutak commented on pull request #33599:
URL: https://github.com/apache/spark/pull/33599#issuecomment-901539804


   retest this please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #33332: [SPARK-36147][SQL] Warn if less files visible after stats write in BasicWriteStatsTracker

2021-08-18 Thread GitBox


HyukjinKwon closed pull request #2:
URL: https://github.com/apache/spark/pull/2


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #33332: [SPARK-36147][SQL] Warn if less files visible after stats write in BasicWriteStatsTracker

2021-08-18 Thread GitBox


HyukjinKwon commented on pull request #2:
URL: https://github.com/apache/spark/pull/2#issuecomment-901539515


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   >