[GitHub] [spark] SparkQA commented on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC
SparkQA commented on pull request #33748: URL: https://github.com/apache/spark/pull/33748#issuecomment-901630384 **[Test build #142639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142639/testReport)** for PR 33748 at commit [`e5f9497`](https://github.com/apache/spark/commit/e5f94971a4af723440e4cac4fa4bfbe6fff018b9). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #33749: [SPARK-36519][SS]Store RocksDB format version in the checkpoint for streaming queries
gengliangwang commented on pull request #33749: URL: https://github.com/apache/spark/pull/33749#issuecomment-901630319 I will cut 3.2.0 RC1 after this one is merged. cc @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #33752: [SPARK-36401][PYTHON] Implement Series.cov
itholic commented on a change in pull request #33752: URL: https://github.com/apache/spark/pull/33752#discussion_r691803163 ## File path: python/pyspark/pandas/series.py ## @@ -944,6 +944,57 @@ def between(self, left: Any, right: Any, inclusive: bool = True) -> "Series": return lmask & rmask +def cov(self, other: "Series", min_periods: int = 1) -> float: +""" +Compute covariance with Series, excluding missing values. +Parameters +-- +other : Series +Series with which to compute the covariance. +min_periods : int, default 1 +Minimum number of observations needed to have a valid result. None = 1. + +Returns +--- +float +Covariance between Series and other + +Examples + +>>> from pyspark.pandas.config import set_option, reset_option +>>> set_option("compute.ops_on_diff_frames", True) +>>> s1 = ps.Series([0.90010907, 0.13484424, 0.62036035]) +>>> s2 = ps.Series([0.12528585, 0.26962463, 0.5198]) +>>> s1.cov(s2) +-0.016857626527158744 +>>> reset_option("compute.ops_on_diff_frames") +""" + +if min_periods is None: +min_periods = 1 + +if same_anchor(self, other): +self_column_label = verify_temp_column_name(other.to_frame(), "__self_column__") +other_column_label = verify_temp_column_name(self.to_frame(), "__other_column__") +combined = DataFrame( +self._internal.with_new_columns( +[self.rename(self_column_label), other.rename(other_column_label)] +) +) Review comment: AFAIK, `count` also collect the data into each node and accumulate them. Generally `head` is faster if you don't need to scan the entire data. `count` always requires to scan all data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33744: [SPARK-36403][PYTHON] Implement `Index.putmask`
SparkQA commented on pull request #33744: URL: https://github.com/apache/spark/pull/33744#issuecomment-901628569 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47137/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33769: [SPARK-36536][SQL] Use CAST for datetime in CSV/JSON by default
HyukjinKwon commented on pull request #33769: URL: https://github.com/apache/spark/pull/33769#issuecomment-901627463 should we maybe update SQL migration guide? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33784: Revert "[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache"
HyukjinKwon commented on pull request #33784: URL: https://github.com/apache/spark/pull/33784#issuecomment-901625868 hm I was thinking that we'd keep that fix since that's already merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
SparkQA commented on pull request #33650: URL: https://github.com/apache/spark/pull/33650#issuecomment-901624380 **[Test build #142638 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142638/testReport)** for PR 33650 at commit [`8907dea`](https://github.com/apache/spark/commit/8907deaf39e5b0c954f6ad4750a0c63cf5fc72e7). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33784: Revert "[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache"
SparkQA commented on pull request #33784: URL: https://github.com/apache/spark/pull/33784#issuecomment-901624206 **[Test build #142637 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142637/testReport)** for PR 33784 at commit [`756c905`](https://github.com/apache/spark/commit/756c905dd2d0d6bba2acb8e468320dda90fd73c3). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
sunchao commented on a change in pull request #33650: URL: https://github.com/apache/spark/pull/33650#discussion_r691797436 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala ## @@ -57,6 +64,27 @@ abstract class FileScanBuilder( StructType(fields) } + def pushFilters(partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Unit = { +this.partitionFilters = partitionFilters +this.dataFilters = dataFilters +this.pushedDataFilters = pushDataFilters(dataFilters) Review comment: I feel the same way, and this can still allow us to have file data sources implementing `SupportsPushDownFilters`, correct? it feels weird that built-in data sources don't use the V2 API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #30483: [SPARK-33449][SQL] Support File Metadata Cache for Parquet
LuciferYang commented on a change in pull request #30483: URL: https://github.com/apache/spark/pull/30483#discussion_r691796638 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -967,6 +967,20 @@ object SQLConf { .booleanConf .createWithDefault(false) + val FILE_META_CACHE_PARQUET_ENABLED = buildConf("spark.sql.fileMetaCache.parquet.enabled") +.doc("To indicate if enable parquet file meta cache, it is recommended to enabled " + + "this config when multiple queries are performed on the same dataset, default is false.") +.version("3.3.0") +.booleanConf +.createWithDefault(false) + + val FILE_META_CACHE_TTL_SINCE_LAST_ACCESS = Review comment: good suggestion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
AmplabJenkins removed a comment on pull request #33650: URL: https://github.com/apache/spark/pull/33650#issuecomment-901622894 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33664: [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning
AmplabJenkins removed a comment on pull request #33664: URL: https://github.com/apache/spark/pull/33664#issuecomment-901622891 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47133/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33744: [SPARK-36403][PYTHON] Implement `Index.putmask`
AmplabJenkins removed a comment on pull request #33744: URL: https://github.com/apache/spark/pull/33744#issuecomment-901622889 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142636/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33744: [SPARK-36403][PYTHON] Implement `Index.putmask`
AmplabJenkins commented on pull request #33744: URL: https://github.com/apache/spark/pull/33744#issuecomment-901622889 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142636/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33664: [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning
AmplabJenkins commented on pull request #33664: URL: https://github.com/apache/spark/pull/33664#issuecomment-901622891 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47133/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
AmplabJenkins commented on pull request #33650: URL: https://github.com/apache/spark/pull/33650#issuecomment-901622897 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC
SparkQA commented on pull request #33748: URL: https://github.com/apache/spark/pull/33748#issuecomment-901622111 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47136/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #33784: Revert "[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache"
LuciferYang commented on pull request #33784: URL: https://github.com/apache/spark/pull/33784#issuecomment-901621745 cc @gatorsmile @HyukjinKwon @zsxwing @sarutak @sunchao @holdenk @mridulm @dongjoon-hyun and `RemoteBlockPushResolver` has some conflicts, I solved it manually. please help review ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang closed pull request #33741: [SPARK-36512][UI][TESTS] Fix UISeleniumSuite in sql/hive-thriftserver
gengliangwang closed pull request #33741: URL: https://github.com/apache/spark/pull/33741 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #33741: [SPARK-36512][UI][TESTS] Fix UISeleniumSuite in sql/hive-thriftserver
gengliangwang commented on pull request #33741: URL: https://github.com/apache/spark/pull/33741#issuecomment-901621425 Thanks, merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request #33784: Revert "[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache"
LuciferYang opened a new pull request #33784: URL: https://github.com/apache/spark/pull/33784 ### What changes were proposed in this pull request? This pr revert the change of SPARK-34309, includes: - https://github.com/apache/spark/pull/31517 - https://github.com/apache/spark/pull/33772 ### Why are the changes needed? 1. No really performance improvement in Spark 2. Added an additional dependency ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
SparkQA removed a comment on pull request #33650: URL: https://github.com/apache/spark/pull/33650#issuecomment-901519101 **[Test build #142630 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142630/testReport)** for PR 33650 at commit [`2cadefb`](https://github.com/apache/spark/commit/2cadefbd4a42e73578634ecb3c04ebe828f48530). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33744: [SPARK-36403][PYTHON] Implement `Index.putmask`
SparkQA removed a comment on pull request #33744: URL: https://github.com/apache/spark/pull/33744#issuecomment-901604941 **[Test build #142636 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142636/testReport)** for PR 33744 at commit [`ac455de`](https://github.com/apache/spark/commit/ac455de560e4e7be1a472abbc0aa1a9907cbdd1a). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
SparkQA commented on pull request #33650: URL: https://github.com/apache/spark/pull/33650#issuecomment-901619291 Kubernetes integration test unable to build dist. exiting with code: 1 URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47135/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC
SparkQA commented on pull request #33748: URL: https://github.com/apache/spark/pull/33748#issuecomment-901618376 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47134/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33744: [SPARK-36403][PYTHON] Implement `Index.putmask`
SparkQA commented on pull request #33744: URL: https://github.com/apache/spark/pull/33744#issuecomment-901614541 **[Test build #142636 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142636/testReport)** for PR 33744 at commit [`ac455de`](https://github.com/apache/spark/commit/ac455de560e4e7be1a472abbc0aa1a9907cbdd1a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33664: [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning
SparkQA commented on pull request #33664: URL: https://github.com/apache/spark/pull/33664#issuecomment-901612824 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47133/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dgd-contributor commented on pull request #33536: [SPARK-36101][CORE] Grouping exception in core/api
dgd-contributor commented on pull request #33536: URL: https://github.com/apache/spark/pull/33536#issuecomment-901612747 @cloud-fan hi, can you review this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
SparkQA commented on pull request #33650: URL: https://github.com/apache/spark/pull/33650#issuecomment-901611530 **[Test build #142630 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142630/testReport)** for PR 33650 at commit [`2cadefb`](https://github.com/apache/spark/commit/2cadefbd4a42e73578634ecb3c04ebe828f48530). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak commented on pull request #33783: [MINOR][DOCS] Mention Hadoop 3 in YARN introduction on cluster-overview.md
sarutak commented on pull request #33783: URL: https://github.com/apache/spark/pull/33783#issuecomment-901609919 Are there any similar places where we should modify in bulk? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
cloud-fan commented on a change in pull request #33650: URL: https://github.com/apache/spark/pull/33650#discussion_r691778707 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala ## @@ -242,4 +244,21 @@ object DataSourceUtils { options } } + + def getPartitionKeyFiltersAndDataFilters( + sparkSession: SparkSession, Review comment: it seems we only need `conf: SQLConf` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
cloud-fan commented on a change in pull request #33650: URL: https://github.com/apache/spark/pull/33650#discussion_r691778075 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ## @@ -2972,37 +2970,6 @@ class JsonV2Suite extends JsonSuite { super .sparkConf .set(SQLConf.USE_V1_SOURCE_LIST, "") - - test("get pushed filters") { Review comment: can we rewrite this test and make it work? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
cloud-fan commented on a change in pull request #33650: URL: https://github.com/apache/spark/pull/33650#discussion_r691777367 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala ## @@ -57,6 +64,27 @@ abstract class FileScanBuilder( StructType(fields) } + def pushFilters(partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Unit = { +this.partitionFilters = partitionFilters +this.dataFilters = dataFilters +this.pushedDataFilters = pushDataFilters(dataFilters) Review comment: We can translate the data filters here before passing to `pushDataFilters` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
cloud-fan commented on a change in pull request #33650: URL: https://github.com/apache/spark/pull/33650#discussion_r691776887 ## File path: external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScanBuilder.scala ## @@ -41,17 +42,16 @@ class AvroScanBuilder ( readDataSchema(), readPartitionSchema(), options, - pushedFilters()) + pushedDataFilters, + partitionFilters, + dataFilters) } - private var _pushedFilters: Array[Filter] = Array.empty - - override def pushFilters(filters: Array[Filter]): Array[Filter] = { + override def pushDataFilters(dataFilters: Seq[Expression]): Array[Filter] = { Review comment: nit: the input parameter can be `dataFilters: Array[Filter]`, then we don't need to ask every source impl to call `translateDataFilter` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33744: [SPARK-36403][PYTHON] Implement `Index.putmask`
SparkQA commented on pull request #33744: URL: https://github.com/apache/spark/pull/33744#issuecomment-901604941 **[Test build #142636 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142636/testReport)** for PR 33744 at commit [`ac455de`](https://github.com/apache/spark/commit/ac455de560e4e7be1a472abbc0aa1a9907cbdd1a). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC
SparkQA commented on pull request #33748: URL: https://github.com/apache/spark/pull/33748#issuecomment-901603993 **[Test build #142634 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142634/testReport)** for PR 33748 at commit [`4adeb62`](https://github.com/apache/spark/commit/4adeb628062bfc091041df845d1c2b9bd7515954). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on a change in pull request #30483: [SPARK-33449][SQL] Support File Metadata Cache for Parquet
sunchao commented on a change in pull request #30483: URL: https://github.com/apache/spark/pull/30483#discussion_r691697807 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -967,6 +967,20 @@ object SQLConf { .booleanConf .createWithDefault(false) + val FILE_META_CACHE_PARQUET_ENABLED = buildConf("spark.sql.fileMetaCache.parquet.enabled") +.doc("To indicate if enable parquet file meta cache, it is recommended to enabled " + Review comment: hmm curious whether this can help if your Spark queries is running as separate Spark jobs, where each of them may use different executors. ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -967,6 +967,20 @@ object SQLConf { .booleanConf .createWithDefault(false) + val FILE_META_CACHE_PARQUET_ENABLED = buildConf("spark.sql.fileMetaCache.parquet.enabled") +.doc("To indicate if enable parquet file meta cache, it is recommended to enabled " + + "this config when multiple queries are performed on the same dataset, default is false.") +.version("3.3.0") +.booleanConf +.createWithDefault(false) + + val FILE_META_CACHE_TTL_SINCE_LAST_ACCESS = Review comment: nit: maybe `FILE_META_CACHE_TTL_SINCE_LAST_ACCESS_SEC` and `spark.sql.fileMetaCache.ttlSinceLastAccessSec` so it's easier to know that the unit is second? ## File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java ## @@ -77,28 +82,31 @@ protected ParquetFileReader reader; + protected ParquetMetadata cachedFooter; + @Override public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException { Configuration configuration = taskAttemptContext.getConfiguration(); FileSplit split = (FileSplit) inputSplit; this.file = split.getPath(); -ParquetReadOptions options = HadoopReadOptions - .builder(configuration) - .withRange(split.getStart(), split.getStart() + split.getLength()) - .build(); -this.reader = new ParquetFileReader(HadoopInputFile.fromPath(file, configuration), options); -this.fileSchema = reader.getFileMetaData().getSchema(); -Map fileMetadata = reader.getFileMetaData().getKeyValueMetaData(); +ParquetMetadata footer = + readFooterByRange(configuration, split.getStart(), split.getStart() + split.getLength()); +this.fileSchema = footer.getFileMetaData().getSchema(); +FilterCompat.Filter filter = ParquetInputFormat.getFilter(configuration); +List blocks = + RowGroupFilter.filterRowGroups(filter, footer.getBlocks(), fileSchema); Review comment: does this apply all the filter levels? e.g., stats, dictionary, and bloom filter. ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileMetaCacheManager.scala ## @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.util.concurrent.TimeUnit + +import com.github.benmanes.caffeine.cache.{CacheLoader, Caffeine} +import com.github.benmanes.caffeine.cache.stats.CacheStats +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkEnv +import org.apache.spark.internal.Logging +import org.apache.spark.sql.internal.SQLConf + +/** + * A singleton Cache Manager to caching file meta. We cache these file metas in order to speed up + * iterated queries over the same dataset. Otherwise, each query would have to hit remote storage + * in order to fetch file meta before read files. + * + * We should implement the corresponding `FileMetaKey` for a specific file format, for example + * `ParquetFileMetaKey` or `OrcFileMetaKey`. By default, the file path is used as the identification + * of the `FileMetaKey` and the `getFileMeta` method of `FileMetaKey` is used to return the file + * meta of the corresponding file format. + */ +object FileMetaCacheManager extends Logging { + + private lazy val cacheLoader = new
[GitHub] [spark] SparkQA commented on pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
SparkQA commented on pull request #33650: URL: https://github.com/apache/spark/pull/33650#issuecomment-901603315 **[Test build #142635 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142635/testReport)** for PR 33650 at commit [`ae98a11`](https://github.com/apache/spark/commit/ae98aaa6a0f14ce86faf313779728f51ecab). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
huaxingao commented on a change in pull request #33650: URL: https://github.com/apache/spark/pull/33650#discussion_r691773116 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala ## @@ -57,6 +63,30 @@ abstract class FileScanBuilder( StructType(fields) } + def pushFiltersToFileIndex( Review comment: Sounds good. Changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33664: [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning
SparkQA commented on pull request #33664: URL: https://github.com/apache/spark/pull/33664#issuecomment-901600120 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47133/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC
LuciferYang commented on a change in pull request #33748: URL: https://github.com/apache/spark/pull/33748#discussion_r691768337 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileMetaCacheManager.scala ## @@ -0,0 +1,95 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.util.concurrent.TimeUnit + +import com.github.benmanes.caffeine.cache.{CacheLoader, Caffeine} +import com.github.benmanes.caffeine.cache.stats.CacheStats +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkEnv +import org.apache.spark.internal.Logging +import org.apache.spark.sql.internal.SQLConf + +/** + * A singleton Cache Manager to caching file meta. We cache these file metas in order to speed up + * iterated queries over the same dataset. Otherwise, each query would have to hit remote storage + * in order to fetch file meta before read files. + * + * We should implement the corresponding `FileMetaKey` for a specific file format, for example + * `ParquetFileMetaKey` or `OrcFileMetaKey`. By default, the file path is used as the identification + * of the `FileMetaKey` and the `getFileMeta` method of `FileMetaKey` is used to return the file + * meta of the corresponding file format. + */ +object FileMetaCacheManager extends Logging { + + private lazy val cacheLoader = new CacheLoader[FileMetaKey, FileMeta]() { Review comment: 59d5bb9 change to use Guava cache and update the benchmark results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #33664: [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning
cloud-fan commented on pull request #33664: URL: https://github.com/apache/spark/pull/33664#issuecomment-901589931 https://github.com/apache/spark/commit/a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-5221c65a64ad82c34cae68169cdb389210a9a28145058ae995b46ff4d3d4964cR39 We put this `OptimizeSubqueries` rule together with the DPP rule at the very beginning. It's kind of a mistake, as once this rule applies, we break plan reuse and thus break DPP. This PR LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sumeetgajjar edited a comment on pull request #33782: [SPARK-35011][CORE][3.0] Avoid Block Manager registrations when StopExecutor msg is in-flight
sumeetgajjar edited a comment on pull request #33782: URL: https://github.com/apache/spark/pull/33782#issuecomment-901588979 Github check failed due to a unrelated UT failure. ``` info] - multiple joins *** FAILED *** (1 second, 104 milliseconds) [info] ArrayBuffer(BroadcastHashJoin [b#147460], [a#147469], Inner, BuildLeft ``` I do not have permission to re-reun the checks, could someone please re-run them? Edit: I ran the same UT locally, it passed without any issues. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sumeetgajjar commented on pull request #33782: [SPARK-35011][CORE][3.0] Avoid Block Manager registrations when StopExecutor msg is in-flight
sumeetgajjar commented on pull request #33782: URL: https://github.com/apache/spark/pull/33782#issuecomment-901588979 Github check failed due to a unrelated UT failure. ``` info] - multiple joins *** FAILED *** (1 second, 104 milliseconds) [info] ArrayBuffer(BroadcastHashJoin [b#147460], [a#147469], Inner, BuildLeft ``` I do not have permission to re-reun the checks, could someone please re-run them? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dgd-contributor commented on a change in pull request #33752: [SPARK-36401][PYTHON] Implement Series.cov
dgd-contributor commented on a change in pull request #33752: URL: https://github.com/apache/spark/pull/33752#discussion_r691758520 ## File path: python/pyspark/pandas/series.py ## @@ -944,6 +944,57 @@ def between(self, left: Any, right: Any, inclusive: bool = True) -> "Series": return lmask & rmask +def cov(self, other: "Series", min_periods: int = 1) -> float: +""" +Compute covariance with Series, excluding missing values. +Parameters +-- +other : Series +Series with which to compute the covariance. +min_periods : int, default 1 +Minimum number of observations needed to have a valid result. None = 1. + +Returns +--- +float +Covariance between Series and other + +Examples + +>>> from pyspark.pandas.config import set_option, reset_option +>>> set_option("compute.ops_on_diff_frames", True) +>>> s1 = ps.Series([0.90010907, 0.13484424, 0.62036035]) +>>> s2 = ps.Series([0.12528585, 0.26962463, 0.5198]) +>>> s1.cov(s2) +-0.016857626527158744 +>>> reset_option("compute.ops_on_diff_frames") +""" + +if min_periods is None: +min_periods = 1 + +if same_anchor(self, other): +self_column_label = verify_temp_column_name(other.to_frame(), "__self_column__") +other_column_label = verify_temp_column_name(self.to_frame(), "__other_column__") +combined = DataFrame( +self._internal.with_new_columns( +[self.rename(self_column_label), other.rename(other_column_label)] +) +) Review comment: I think sdf.count() may be better than len(sdf.head(min_periods)) because it not collect data to driver. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33673: [SPARK-36448][SQL] Exceptions in NoSuchItemException.scala have to be case classes
AmplabJenkins removed a comment on pull request #33673: URL: https://github.com/apache/spark/pull/33673#issuecomment-900866537 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142587/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #33673: [SPARK-36448][SQL] Exceptions in NoSuchItemException.scala have to be case classes
cloud-fan commented on a change in pull request #33673: URL: https://github.com/apache/spark/pull/33673#discussion_r691756821 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala ## @@ -29,18 +29,24 @@ import org.apache.spark.sql.types.StructType * Thrown by a catalog when an item cannot be found. The analyzer will rethrow the exception * as an [[org.apache.spark.sql.AnalysisException]] with the correct position information. */ -class NoSuchDatabaseException( -val db: String) extends NoSuchNamespaceException(s"Database '$db' not found") +case class NoSuchDatabaseException(db: String) + extends AnalysisException(s"Database '$db' not found") Review comment: I can't think of a better way. AFAIK it's an ill pattern to extend a case class in Scala. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #33736: [SPARK-35991][SQL] Add PlanStability suite for TPCH
cloud-fan closed pull request #33736: URL: https://github.com/apache/spark/pull/33736 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #33736: [SPARK-35991][SQL] Add PlanStability suite for TPCH
cloud-fan commented on pull request #33736: URL: https://github.com/apache/spark/pull/33736#issuecomment-901584521 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33599: [SPARK-36371][SQL] Support raw string literal
AmplabJenkins removed a comment on pull request #33599: URL: https://github.com/apache/spark/pull/33599#issuecomment-901570913 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47131/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #33599: [SPARK-36371][SQL] Support raw string literal
cloud-fan closed pull request #33599: URL: https://github.com/apache/spark/pull/33599 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #33599: [SPARK-36371][SQL] Support raw string literal
cloud-fan commented on pull request #33599: URL: https://github.com/apache/spark/pull/33599#issuecomment-901583747 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33664: [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning
SparkQA commented on pull request #33664: URL: https://github.com/apache/spark/pull/33664#issuecomment-901582918 **[Test build #142633 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142633/testReport)** for PR 33664 at commit [`0d7e228`](https://github.com/apache/spark/commit/0d7e228b42b92abf1ce15681a2b95361dac4). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33783: [MINOR][DOCS] Mention Hadoop 3 in YARN introduction on cluster-overview.md
AmplabJenkins commented on pull request #33783: URL: https://github.com/apache/spark/pull/33783#issuecomment-901582593 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC
AmplabJenkins removed a comment on pull request #33748: URL: https://github.com/apache/spark/pull/33748#issuecomment-901581916 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47132/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC
AmplabJenkins commented on pull request #33748: URL: https://github.com/apache/spark/pull/33748#issuecomment-901581916 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47132/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yutoacts commented on a change in pull request #33777: [SPARK-36538][DOCS] Fix the environment variables part in configuration.md
yutoacts commented on a change in pull request #33777: URL: https://github.com/apache/spark/pull/33777#discussion_r691749052 ## File path: docs/configuration.md ## @@ -3075,7 +3075,7 @@ to use on each machine and maximum memory. Since `spark-env.sh` is a shell script, some of these can be set programmatically -- for example, you might compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface. -Note: When running Spark on YARN in `cluster` mode, environment variables need to be set using the `spark.yarn.appMasterEnv.[EnvironmentVariableName]` property in your `conf/spark-defaults.conf` file. Environment variables that are set in `spark-env.sh` will not be reflected in the YARN Application Master process in `cluster` mode. See the [YARN-related Spark Properties](running-on-yarn.html#spark-properties) for more information. Review comment: I think I totally misunderstood what it says.. Thank you for the correction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yutoacts closed pull request #33777: [SPARK-36538][DOCS] Fix the environment variables part in configuration.md
yutoacts closed pull request #33777: URL: https://github.com/apache/spark/pull/33777 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC
SparkQA commented on pull request #33748: URL: https://github.com/apache/spark/pull/33748#issuecomment-901578737 Kubernetes integration test unable to build dist. exiting with code: 1 URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47132/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yutoacts commented on pull request #33568: [SPARK-36335][DOCS] Remove Local-cluster mode reference (and add a missing period)
yutoacts commented on pull request #33568: URL: https://github.com/apache/spark/pull/33568#issuecomment-901576683 It ended up as https://github.com/apache/spark/pull/33537. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yutoacts closed pull request #33568: [SPARK-36335][DOCS] Remove Local-cluster mode reference (and add a missing period)
yutoacts closed pull request #33568: URL: https://github.com/apache/spark/pull/33568 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yutoacts opened a new pull request #33783: [MINOR][DOCS] Mention Hadoop 3 in YARN introduction on cluster-overview.md
yutoacts opened a new pull request #33783: URL: https://github.com/apache/spark/pull/33783 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dgd-contributor commented on a change in pull request #33752: [SPARK-36401][PYTHON] Implement Series.cov
dgd-contributor commented on a change in pull request #33752: URL: https://github.com/apache/spark/pull/33752#discussion_r691744731 ## File path: python/pyspark/pandas/series.py ## @@ -944,6 +944,57 @@ def between(self, left: Any, right: Any, inclusive: bool = True) -> "Series": return lmask & rmask +def cov(self, other: "Series", min_periods: int = 1) -> float: +""" +Compute covariance with Series, excluding missing values. +Parameters +-- +other : Series +Series with which to compute the covariance. +min_periods : int, default 1 +Minimum number of observations needed to have a valid result. None = 1. + +Returns +--- +float +Covariance between Series and other + +Examples + +>>> from pyspark.pandas.config import set_option, reset_option +>>> set_option("compute.ops_on_diff_frames", True) +>>> s1 = ps.Series([0.90010907, 0.13484424, 0.62036035]) +>>> s2 = ps.Series([0.12528585, 0.26962463, 0.5198]) +>>> s1.cov(s2) +-0.016857626527158744 +>>> reset_option("compute.ops_on_diff_frames") +""" + +if min_periods is None: +min_periods = 1 + +if same_anchor(self, other): +self_column_label = verify_temp_column_name(other.to_frame(), "__self_column__") +other_column_label = verify_temp_column_name(self.to_frame(), "__other_column__") +combined = DataFrame( +self._internal.with_new_columns( +[self.rename(self_column_label), other.rename(other_column_label)] +) +) Review comment: Thank you so much. Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #33664: [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning
wangyum commented on pull request #33664: URL: https://github.com/apache/spark/pull/33664#issuecomment-901573272 retest this please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #33781: [SPARK-33687][SQL][DOC][FOLLOWUP] Merge the doc pages of ANALYZE TABLE and ANALYZE TABLES
cloud-fan closed pull request #33781: URL: https://github.com/apache/spark/pull/33781 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #33781: [SPARK-33687][SQL][DOC][FOLLOWUP] Merge the doc pages of ANALYZE TABLE and ANALYZE TABLES
cloud-fan commented on pull request #33781: URL: https://github.com/apache/spark/pull/33781#issuecomment-901572333 thanks for the review, merging to master/3.2! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
cloud-fan commented on a change in pull request #33650: URL: https://github.com/apache/spark/pull/33650#discussion_r691740952 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala ## @@ -57,6 +63,30 @@ abstract class FileScanBuilder( StructType(fields) } + def pushFiltersToFileIndex( Review comment: this pushes data filters to the underlying file format as well. How about ``` protected var partitionFilters = Seq.empty[Expression] protected var dataFilters = Seq.empty[Expression] protected var pushedDataFilters = Seq.empty[Filter] ... def pushFilters(partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Unit = { this.partitionFilters = partitionFilters this.dataFilters = dataFilters this.pushedDataFilters = pushDataFilters(dataFilters) } protected def pushDataFilters(dataFilters: Seq[Expression]) = Nil ``` Then file source impl can just override `pushDataFilters` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
cloud-fan commented on a change in pull request #33650: URL: https://github.com/apache/spark/pull/33650#discussion_r691740952 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala ## @@ -57,6 +63,30 @@ abstract class FileScanBuilder( StructType(fields) } + def pushFiltersToFileIndex( Review comment: this pushes data filters to the underlying file format as well. How about ``` protected var partitionFilters = Seq.empty[Expression] protected var dataFilters = Seq.empty[Expression] protected var pushedDataFilters = Seq.empty[Expression] ... def pushFilters(partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Unit = { this.partitionFilters = partitionFilters this.dataFilters = dataFilters this.pushedDataFilters = pushDataFilters(dataFilters) } protected def pushDataFilters(dataFilters: Seq[Expression]) = Nil ``` Then file source impl can just override `pushDataFilters` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC
LuciferYang commented on a change in pull request #33748: URL: https://github.com/apache/spark/pull/33748#discussion_r691734091 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileMetaCacheManager.scala ## @@ -0,0 +1,95 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.util.concurrent.TimeUnit + +import com.github.benmanes.caffeine.cache.{CacheLoader, Caffeine} +import com.github.benmanes.caffeine.cache.stats.CacheStats +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkEnv +import org.apache.spark.internal.Logging +import org.apache.spark.sql.internal.SQLConf + +/** + * A singleton Cache Manager to caching file meta. We cache these file metas in order to speed up + * iterated queries over the same dataset. Otherwise, each query would have to hit remote storage + * in order to fetch file meta before read files. + * + * We should implement the corresponding `FileMetaKey` for a specific file format, for example + * `ParquetFileMetaKey` or `OrcFileMetaKey`. By default, the file path is used as the identification + * of the `FileMetaKey` and the `getFileMeta` method of `FileMetaKey` is used to return the file + * meta of the corresponding file format. + */ +object FileMetaCacheManager extends Logging { + + private lazy val cacheLoader = new CacheLoader[FileMetaKey, FileMeta]() { Review comment: @dongjoon-hyun will change to use Guava because SPARK-34309 will be revert, I need to update the benchmark results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33599: [SPARK-36371][SQL] Support raw string literal
AmplabJenkins commented on pull request #33599: URL: https://github.com/apache/spark/pull/33599#issuecomment-901570913 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47131/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33599: [SPARK-36371][SQL] Support raw string literal
SparkQA commented on pull request #33599: URL: https://github.com/apache/spark/pull/33599#issuecomment-901570893 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47131/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #33629: [SPARK-36407][CORE][SQL] Convert int to long to avoid potential integer multiplications overflow risk
LuciferYang commented on pull request #33629: URL: https://github.com/apache/spark/pull/33629#issuecomment-901570511 thank @srowen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dgd-contributor closed pull request #33779: [SPARK-36302][SQL]: Refactor thirteenth set of 20 query execution errors to use error classes
dgd-contributor closed pull request #33779: URL: https://github.com/apache/spark/pull/33779 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks commented on a change in pull request #33714: [SPARK-36399][PYTHON] Implement DataFrame.combine_first
xinrong-databricks commented on a change in pull request #33714: URL: https://github.com/apache/spark/pull/33714#discussion_r691736383 ## File path: python/pyspark/pandas/tests/test_dataframe.py ## @@ -5614,6 +5614,40 @@ def test_at_time(self): with self.assertRaisesRegex(TypeError, "Index must be DatetimeIndex"): psdf.at_time("0:15") +def test_combine_first(self): Review comment: Let me take a look then, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC
SparkQA commented on pull request #33748: URL: https://github.com/apache/spark/pull/33748#issuecomment-901564736 **[Test build #142632 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142632/testReport)** for PR 33748 at commit [`c3838e6`](https://github.com/apache/spark/commit/c3838e68241d5f8409cbcc565815a494e7eb245b). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC
LuciferYang commented on a change in pull request #33748: URL: https://github.com/apache/spark/pull/33748#discussion_r691734091 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileMetaCacheManager.scala ## @@ -0,0 +1,95 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.util.concurrent.TimeUnit + +import com.github.benmanes.caffeine.cache.{CacheLoader, Caffeine} +import com.github.benmanes.caffeine.cache.stats.CacheStats +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkEnv +import org.apache.spark.internal.Logging +import org.apache.spark.sql.internal.SQLConf + +/** + * A singleton Cache Manager to caching file meta. We cache these file metas in order to speed up + * iterated queries over the same dataset. Otherwise, each query would have to hit remote storage + * in order to fetch file meta before read files. + * + * We should implement the corresponding `FileMetaKey` for a specific file format, for example + * `ParquetFileMetaKey` or `OrcFileMetaKey`. By default, the file path is used as the identification + * of the `FileMetaKey` and the `getFileMeta` method of `FileMetaKey` is used to return the file + * meta of the corresponding file format. + */ +object FileMetaCacheManager extends Logging { + + private lazy val cacheLoader = new CacheLoader[FileMetaKey, FileMeta]() { Review comment: @dongjoon-hyun will revert to use Guava because SPARK-34309 will be revert, I need to update the benchmark results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
huaxingao commented on a change in pull request #33650: URL: https://github.com/apache/spark/pull/33650#discussion_r691733542 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala ## @@ -38,9 +37,9 @@ object PushDownUtils extends PredicateHelper { * @return pushed filter and post-scan filters. */ def pushFilters( - scanBuilder: ScanBuilder, + scanBuilderHolder: ScanBuilderHolder, Review comment: because I need the `scanBuilderHolder.relation` for `DataSourceUtils.getPartitionKeyFiltersAndDataFilters` ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala ## @@ -50,8 +49,17 @@ object PushDownUtils extends PredicateHelper { val translatedFilters = mutable.ArrayBuffer.empty[sources.Filter] // Catalyst filter expression that can't be translated to data source filters. val untranslatableExprs = mutable.ArrayBuffer.empty[Expression] +val dataFilters = r match { + case f: FileScanBuilder => +val (partitionFilters, fileDataFilters) = + DataSourceUtils.getPartitionKeyFiltersAndDataFilters( + f.getSparkSession, scanBuilderHolder.relation, f.readPartitionSchema(), filters) +f.pushPartitionFilters(ExpressionSet(partitionFilters).toSeq, fileDataFilters) Review comment: As per our offline discussion, I have made the following changes: - make file source v2 NOT implement `SupportsPushdownFilters` any more - add `pushFiltersToFileIndex` in file source v2. In this method: - push both Expression partition filters and Expression data filters to file source. - data filters are used for filter push down. File source translates the data filters from `Expression` to `Sources.Filer`, and decides which filters to push down. - partition filters are used for partition pruning. I have updated the PR description accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC
LuciferYang commented on a change in pull request #33748: URL: https://github.com/apache/spark/pull/33748#discussion_r691733053 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -967,6 +967,32 @@ object SQLConf { .booleanConf .createWithDefault(false) + val FILE_META_CACHE_ENABLED_SOURCE_LIST = buildConf("spark.sql.fileMetaCache.enabledSourceList") +.doc("A comma-separated list of data source short names for which data source enabled file " + + "meta cache, now the file meta cache only support ORC, it is recommended to enabled this " + + "config when multiple queries are performed on the same dataset, default is false." + + "Warning: if the fileMetaCache is enabled, the existing data files should not be " + + "replaced with the same file name, otherwise there will be a risk of job failure or wrong " + + "data reading before the cache entry expires.") +.version("3.3.0") +.stringConf Review comment: c3838e6 add `.checkValue` and test case -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #33749: [SPARK-36519][SS]Store RocksDB format version in the checkpoint for streaming queries
HeartSaVioR edited a comment on pull request #33749: URL: https://github.com/apache/spark/pull/33749#issuecomment-901562572 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #33749: [SPARK-36519][SS]Store RocksDB format version in the checkpoint for streaming queries
HeartSaVioR commented on pull request #33749: URL: https://github.com/apache/spark/pull/33749#issuecomment-901562572 I'll merge this early tomorrow if there's no further comment, or @viirya is OK with this. cc. @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33588: [SPARK-36346][SQL] Support TimestampNTZ type in Orc file source
AmplabJenkins removed a comment on pull request #33588: URL: https://github.com/apache/spark/pull/33588#issuecomment-901136096 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142588/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer removed a comment on pull request #33588: [SPARK-36346][SQL] Support TimestampNTZ type in Orc file source
beliefer removed a comment on pull request #33588: URL: https://github.com/apache/spark/pull/33588#issuecomment-900137467 ping @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer edited a comment on pull request #33588: [SPARK-36346][SQL] Support TimestampNTZ type in Orc file source
beliefer edited a comment on pull request #33588: URL: https://github.com/apache/spark/pull/33588#issuecomment-901186384 ping @gengliangwang @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33714: [SPARK-36399][PYTHON] Implement DataFrame.combine_first
AmplabJenkins removed a comment on pull request #33714: URL: https://github.com/apache/spark/pull/33714#issuecomment-901561191 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47129/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33782: [SPARK-35011][CORE][3.0] Avoid Block Manager registrations when StopExecutor msg is in-flight
AmplabJenkins commented on pull request #33782: URL: https://github.com/apache/spark/pull/33782#issuecomment-901561551 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33714: [SPARK-36399][PYTHON] Implement DataFrame.combine_first
AmplabJenkins commented on pull request #33714: URL: https://github.com/apache/spark/pull/33714#issuecomment-901561191 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47129/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sumeetgajjar commented on pull request #33770: [SPARK-34949][CORE][3.0] Prevent BlockManager reregister when Executor is shutting down
sumeetgajjar commented on pull request #33770: URL: https://github.com/apache/spark/pull/33770#issuecomment-901560073 Thank you @dongjoon-hyun and @holdenk for taking a look at this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sumeetgajjar commented on pull request #33771: [SPARK-35011][CORE][3.1] Avoid Block Manager registrations when StopExecutor msg is in-flight
sumeetgajjar commented on pull request #33771: URL: https://github.com/apache/spark/pull/33771#issuecomment-901559674 Thank you @dongjoon-hyun and @zhuqi-lucas for approving this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sumeetgajjar commented on pull request #33782: [SPARK-35011][CORE][3.0] Avoid Block Manager registrations when StopExecutor msg is in-flight
sumeetgajjar commented on pull request #33782: URL: https://github.com/apache/spark/pull/33782#issuecomment-901559419 @dongjoon-hyun @mridulm @Ngone51 Could you please take a look at this backport PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sumeetgajjar opened a new pull request #33782: [SPARK-35011][CORE][3.0] Avoid Block Manager registrations when StopExecutor msg is in-flight
sumeetgajjar opened a new pull request #33782: URL: https://github.com/apache/spark/pull/33782 This PR backports #32114 to 3.0 ### What changes were proposed in this pull request? This patch proposes a fix to prevent triggering BlockManager reregistration while `StopExecutor` msg is in-flight. Here on receiving `StopExecutor` msg, we do not remove the corresponding `BlockManagerInfo` from `blockManagerInfo` map, instead we mark it as dead by updating the corresponding `executorRemovalTs`. There's a separate cleanup thread running to periodically remove the stale `BlockManagerInfo` from `blockManangerInfo` map. Now if a recently removed `BlockManager` tries to register, the driver simply ignores it since the `blockManagerInfo` map already contains an entry for it. The same applies to `BlockManagerHeartbeat`, if the BlockManager belongs to a recently removed executor, the `blockManagerInfo` map would contain an entry and we shall not ask the corresponding `BlockManager` to re-register. ### Why are the changes needed? This changes are needed since BlockManager reregistration while executor is shutting down causes inconsistent bookkeeping of executors in Spark. Consider the following scenario: - `CoarseGrainedSchedulerBackend` issues async `StopExecutor` on executorEndpoint - `CoarseGrainedSchedulerBackend` removes that executor from Driver's internal data structures and publishes `SparkListenerExecutorRemoved` on the `listenerBus`. - Executor has still not processed `StopExecutor` from the Driver - Driver receives heartbeat from the Executor, since it cannot find the `executorId` in its data structures, it responds with `HeartbeatResponse(reregisterBlockManager = true)` - `BlockManager` on the Executor reregisters with the `BlockManagerMaster` and `SparkListenerBlockManagerAdded` is published on the `listenerBus` - Executor starts processing the `StopExecutor` and exits - `AppStatusListener` picks the `SparkListenerBlockManagerAdded` event and updates `AppStatusStore` - `statusTracker.getExecutorInfos` refers `AppStatusStore` to get the list of executors which returns the dead executor as alive. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Modified the existing unittests. - Ran a simple test application on minikube that asserts on number of executors are zero once the executor idle timeout is reached. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33599: [SPARK-36371][SQL] Support raw string literal
SparkQA commented on pull request #33599: URL: https://github.com/apache/spark/pull/33599#issuecomment-901558167 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47131/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #30057: [SPARK-32838][SQL]Check DataSource insert command path with actual path
AngersZh commented on pull request #30057: URL: https://github.com/apache/spark/pull/30057#issuecomment-901556045 gentle ping @cloud-fan @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33714: [SPARK-36399][PYTHON] Implement DataFrame.combine_first
SparkQA commented on pull request #33714: URL: https://github.com/apache/spark/pull/33714#issuecomment-901555483 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47129/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33599: [SPARK-36371][SQL] Support raw string literal
SparkQA commented on pull request #33599: URL: https://github.com/apache/spark/pull/33599#issuecomment-901541894 **[Test build #142631 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142631/testReport)** for PR 33599 at commit [`ec963ef`](https://github.com/apache/spark/commit/ec963efa51cd02cf6816d4eebcf645c709e43f09). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33599: [SPARK-36371][SQL] Support raw string literal
AmplabJenkins removed a comment on pull request #33599: URL: https://github.com/apache/spark/pull/33599#issuecomment-901276573 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33650: [SPARK-36351][SQL] Refactor filter push down in file source v2
AmplabJenkins removed a comment on pull request #33650: URL: https://github.com/apache/spark/pull/33650#issuecomment-901539496 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33714: [SPARK-36399][PYTHON] Implement DataFrame.combine_first
AmplabJenkins removed a comment on pull request #33714: URL: https://github.com/apache/spark/pull/33714#issuecomment-901539497 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak commented on pull request #33599: [SPARK-36371][SQL] Support raw string literal
sarutak commented on pull request #33599: URL: https://github.com/apache/spark/pull/33599#issuecomment-901539804 retest this please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #33332: [SPARK-36147][SQL] Warn if less files visible after stats write in BasicWriteStatsTracker
HyukjinKwon closed pull request #2: URL: https://github.com/apache/spark/pull/2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33332: [SPARK-36147][SQL] Warn if less files visible after stats write in BasicWriteStatsTracker
HyukjinKwon commented on pull request #2: URL: https://github.com/apache/spark/pull/2#issuecomment-901539515 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org