Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]
andygrove commented on PR #1757: URL: https://github.com/apache/datafusion-comet/pull/1757#issuecomment-2898149587 I will close this since this will make all Spark SQL tests fall back to Spark. We seem to mostly support this feature in `native_iceberg_compat` already. I guess we have no choice but to add support in `native_datafusion`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]
andygrove closed pull request #1757: fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans URL: https://github.com/apache/datafusion-comet/pull/1757 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]
comphead commented on code in PR #1757: URL: https://github.com/apache/datafusion-comet/pull/1757#discussion_r2098944769 ## spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala: ## @@ -41,36 +41,44 @@ import org.apache.comet.parquet.{CometParquetScan, SupportsComet} * Spark physical optimizer rule for replacing Spark scans with Comet scans. */ case class CometScanRule(session: SparkSession) extends Rule[SparkPlan] { + override def apply(plan: SparkPlan): SparkPlan = { -if (!isCometLoaded(conf) || !isCometScanEnabled(conf)) { - if (!isCometLoaded(conf)) { -withInfo(plan, "Comet is not enabled") - } else if (!isCometScanEnabled(conf)) { -withInfo(plan, "Comet Scan is not enabled") - } - plan -} else { - - def hasMetadataCol(plan: SparkPlan): Boolean = { -plan.expressions.exists(_.exists { - case a: Attribute => -a.isMetadataCol - case _ => false -}) - } - - plan.transform { -case scan if hasMetadataCol(scan) => - withInfo(scan, "Metadata column is not supported") - -// data source V1 -case scanExec: FileSourceScanExec => - transformV1Scan(scanExec) - -// data source V2 -case scanExec: BatchScanExec => - transformV2Scan(scanExec) - } +if (!isCometLoaded(conf)) { + withInfo(plan, "Comet is not enabled") + return plan +} + +if (!isCometScanEnabled(conf)) { + withInfo(plan, "Comet Scan is not enabled") + return plan +} + +val scanImpl: String = COMET_NATIVE_SCAN_IMPL.get() +if (SQLConf.get.getConf( +SQLConf.PARQUET_FIELD_ID_READ_ENABLED) && scanImpl != CometConf.SCAN_NATIVE_COMET) { + withInfo(plan, s"Comet $scanImpl scan does not support PARQUET_FIELD_ID_READ_ENABLED") Review Comment: ```suggestion withInfo(plan, s"Comet $scanImpl scan does not support with enabled `spark.sql.parquet.fieldId.read.enabled`") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]
parthchandra commented on PR #1757: URL: https://github.com/apache/datafusion-comet/pull/1757#issuecomment-2895798125 Ouch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]
andygrove commented on PR #1757: URL: https://github.com/apache/datafusion-comet/pull/1757#issuecomment-2895355782 `SQLConf.PARQUET_FIELD_ID_READ_ENABLED` is enabled in **all** Spark tests, so not sure what to do about this now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]
parthchandra commented on PR #1757: URL: https://github.com/apache/datafusion-comet/pull/1757#issuecomment-2895301640 Pre-emptively approving this. We can defer field_id support for native readers for the time being. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]
codecov-commenter commented on PR #1757: URL: https://github.com/apache/datafusion-comet/pull/1757#issuecomment-2895250157 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1757?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report Attention: Patch coverage is `65.21739%` with `8 lines` in your changes missing coverage. Please review. > Project coverage is 57.21%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`4a84257`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/4a84257893d12a9f77ffc350ad42d2fd6b5d05ba?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 199 commits behind head on main. | [Files with missing lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/1757?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines | |---|---|---| | [...n/scala/org/apache/comet/rules/CometScanRule.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1757?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Frules%2FCometScanRule.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9ydWxlcy9Db21ldFNjYW5SdWxlLnNjYWxh) | 65.21% | [3 Missing and 5 partials :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1757?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1757 +/- ## + Coverage 56.12% 57.21% +1.08% - Complexity 976 1070 +94 Files 119 130 +11 Lines 1174312603 +860 Branches 2251 2361 +110 + Hits 6591 7211 +620 - Misses 4012 4185 +173 - Partials 1140 1207 +67 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1757?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :rocket: New features to boost your workflow: - :snowflake: [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]
andygrove commented on PR #1757: URL: https://github.com/apache/datafusion-comet/pull/1757#issuecomment-2895073250 This may make too many tests fall back because Spark may be enabling this by default in all tests ... investigating -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]
andygrove commented on code in PR #1757: URL: https://github.com/apache/datafusion-comet/pull/1757#discussion_r2098328838 ## spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala: ## @@ -41,36 +41,44 @@ import org.apache.comet.parquet.{CometParquetScan, SupportsComet} * Spark physical optimizer rule for replacing Spark scans with Comet scans. */ case class CometScanRule(session: SparkSession) extends Rule[SparkPlan] { + private val scanImpl: String = COMET_NATIVE_SCAN_IMPL.get() + override def apply(plan: SparkPlan): SparkPlan = { -if (!isCometLoaded(conf) || !isCometScanEnabled(conf)) { - if (!isCometLoaded(conf)) { -withInfo(plan, "Comet is not enabled") - } else if (!isCometScanEnabled(conf)) { -withInfo(plan, "Comet Scan is not enabled") - } - plan -} else { - - def hasMetadataCol(plan: SparkPlan): Boolean = { -plan.expressions.exists(_.exists { - case a: Attribute => -a.isMetadataCol - case _ => false -}) - } - - plan.transform { -case scan if hasMetadataCol(scan) => - withInfo(scan, "Metadata column is not supported") - -// data source V1 -case scanExec: FileSourceScanExec => - transformV1Scan(scanExec) - -// data source V2 -case scanExec: BatchScanExec => - transformV2Scan(scanExec) - } +if (!isCometLoaded(conf)) { + withInfo(plan, "Comet is not enabled") + return plan +} + +if (!isCometScanEnabled(conf)) { + withInfo(plan, "Comet Scan is not enabled") + return plan +} + +if (SQLConf.get.getConf( +SQLConf.PARQUET_FIELD_ID_READ_ENABLED) && scanImpl != CometConf.SCAN_NATIVE_COMET) { + withInfo(plan, s"Comet $scanImpl scan does not support PARQUET_FIELD_ID_READ_ENABLED") + return plan +} Review Comment: This is the change. The rest is refactoring. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org