Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]

2025-05-21 Thread via GitHub


andygrove commented on PR #1757:
URL: 
https://github.com/apache/datafusion-comet/pull/1757#issuecomment-2898149587

   I will close this since this will make all Spark SQL tests fall back to 
Spark. We seem to mostly support this feature in `native_iceberg_compat` 
already. I guess we have no choice but to add support in `native_datafusion`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]

2025-05-21 Thread via GitHub


andygrove closed pull request #1757: fix: Fallback to Spark when 
PARQUET_FIELD_ID_READ_ENABLED=true for new native scans
URL: https://github.com/apache/datafusion-comet/pull/1757


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]

2025-05-20 Thread via GitHub


comphead commented on code in PR #1757:
URL: https://github.com/apache/datafusion-comet/pull/1757#discussion_r2098944769


##
spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala:
##
@@ -41,36 +41,44 @@ import org.apache.comet.parquet.{CometParquetScan, 
SupportsComet}
  * Spark physical optimizer rule for replacing Spark scans with Comet scans.
  */
 case class CometScanRule(session: SparkSession) extends Rule[SparkPlan] {
+
   override def apply(plan: SparkPlan): SparkPlan = {
-if (!isCometLoaded(conf) || !isCometScanEnabled(conf)) {
-  if (!isCometLoaded(conf)) {
-withInfo(plan, "Comet is not enabled")
-  } else if (!isCometScanEnabled(conf)) {
-withInfo(plan, "Comet Scan is not enabled")
-  }
-  plan
-} else {
-
-  def hasMetadataCol(plan: SparkPlan): Boolean = {
-plan.expressions.exists(_.exists {
-  case a: Attribute =>
-a.isMetadataCol
-  case _ => false
-})
-  }
-
-  plan.transform {
-case scan if hasMetadataCol(scan) =>
-  withInfo(scan, "Metadata column is not supported")
-
-// data source V1
-case scanExec: FileSourceScanExec =>
-  transformV1Scan(scanExec)
-
-// data source V2
-case scanExec: BatchScanExec =>
-  transformV2Scan(scanExec)
-  }
+if (!isCometLoaded(conf)) {
+  withInfo(plan, "Comet is not enabled")
+  return plan
+}
+
+if (!isCometScanEnabled(conf)) {
+  withInfo(plan, "Comet Scan is not enabled")
+  return plan
+}
+
+val scanImpl: String = COMET_NATIVE_SCAN_IMPL.get()
+if (SQLConf.get.getConf(
+SQLConf.PARQUET_FIELD_ID_READ_ENABLED) && scanImpl != 
CometConf.SCAN_NATIVE_COMET) {
+  withInfo(plan, s"Comet $scanImpl scan does not support 
PARQUET_FIELD_ID_READ_ENABLED")

Review Comment:
   ```suggestion
 withInfo(plan, s"Comet $scanImpl scan does not support with enabled 
`spark.sql.parquet.fieldId.read.enabled`")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]

2025-05-20 Thread via GitHub


parthchandra commented on PR #1757:
URL: 
https://github.com/apache/datafusion-comet/pull/1757#issuecomment-2895798125

   Ouch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]

2025-05-20 Thread via GitHub


andygrove commented on PR #1757:
URL: 
https://github.com/apache/datafusion-comet/pull/1757#issuecomment-2895355782

   `SQLConf.PARQUET_FIELD_ID_READ_ENABLED` is enabled in **all** Spark tests, 
so not sure what to do about this now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]

2025-05-20 Thread via GitHub


parthchandra commented on PR #1757:
URL: 
https://github.com/apache/datafusion-comet/pull/1757#issuecomment-2895301640

   Pre-emptively approving this. We can defer field_id support for native 
readers for the time being.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]

2025-05-20 Thread via GitHub


codecov-commenter commented on PR #1757:
URL: 
https://github.com/apache/datafusion-comet/pull/1757#issuecomment-2895250157

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1757?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   Attention: Patch coverage is `65.21739%` with `8 lines` in your changes 
missing coverage. Please review.
   > Project coverage is 57.21%. Comparing base 
[(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`4a84257`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/4a84257893d12a9f77ffc350ad42d2fd6b5d05ba?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 199 commits behind head on main.
   
   | [Files with missing 
lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/1757?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 | Patch % | Lines |
   |---|---|---|
   | 
[...n/scala/org/apache/comet/rules/CometScanRule.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1757?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Frules%2FCometScanRule.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9ydWxlcy9Db21ldFNjYW5SdWxlLnNjYWxh)
 | 65.21% | [3 Missing and 5 partials :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/1757?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   
   Additional details and impacted files
   
   
   ```diff
   @@ Coverage Diff  @@
   ##   main#1757  +/-   ##
   
   + Coverage 56.12%   57.21%   +1.08% 
   - Complexity  976 1070  +94 
   
 Files   119  130  +11 
 Lines 1174312603 +860 
 Branches   2251 2361 +110 
   
   + Hits   6591 7211 +620 
   - Misses 4012 4185 +173 
   - Partials   1140 1207  +67 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1757?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
:rocket: New features to boost your workflow: 
   
   - :snowflake: [Test 
Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, 
report on failures, and find test suite problems.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]

2025-05-20 Thread via GitHub


andygrove commented on PR #1757:
URL: 
https://github.com/apache/datafusion-comet/pull/1757#issuecomment-2895073250

   This may make too many tests fall back because Spark may be enabling this by 
default in all tests ... investigating


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans [datafusion-comet]

2025-05-20 Thread via GitHub


andygrove commented on code in PR #1757:
URL: https://github.com/apache/datafusion-comet/pull/1757#discussion_r2098328838


##
spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala:
##
@@ -41,36 +41,44 @@ import org.apache.comet.parquet.{CometParquetScan, 
SupportsComet}
  * Spark physical optimizer rule for replacing Spark scans with Comet scans.
  */
 case class CometScanRule(session: SparkSession) extends Rule[SparkPlan] {
+  private val scanImpl: String = COMET_NATIVE_SCAN_IMPL.get()
+
   override def apply(plan: SparkPlan): SparkPlan = {
-if (!isCometLoaded(conf) || !isCometScanEnabled(conf)) {
-  if (!isCometLoaded(conf)) {
-withInfo(plan, "Comet is not enabled")
-  } else if (!isCometScanEnabled(conf)) {
-withInfo(plan, "Comet Scan is not enabled")
-  }
-  plan
-} else {
-
-  def hasMetadataCol(plan: SparkPlan): Boolean = {
-plan.expressions.exists(_.exists {
-  case a: Attribute =>
-a.isMetadataCol
-  case _ => false
-})
-  }
-
-  plan.transform {
-case scan if hasMetadataCol(scan) =>
-  withInfo(scan, "Metadata column is not supported")
-
-// data source V1
-case scanExec: FileSourceScanExec =>
-  transformV1Scan(scanExec)
-
-// data source V2
-case scanExec: BatchScanExec =>
-  transformV2Scan(scanExec)
-  }
+if (!isCometLoaded(conf)) {
+  withInfo(plan, "Comet is not enabled")
+  return plan
+}
+
+if (!isCometScanEnabled(conf)) {
+  withInfo(plan, "Comet Scan is not enabled")
+  return plan
+}
+
+if (SQLConf.get.getConf(
+SQLConf.PARQUET_FIELD_ID_READ_ENABLED) && scanImpl != 
CometConf.SCAN_NATIVE_COMET) {
+  withInfo(plan, s"Comet $scanImpl scan does not support 
PARQUET_FIELD_ID_READ_ENABLED")
+  return plan
+}

Review Comment:
   This is the change. The rest is refactoring.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org