Re: [PR] feat: Add auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]

2025-05-29 Thread via GitHub


andygrove commented on PR #1747:
URL: 
https://github.com/apache/datafusion-comet/pull/1747#issuecomment-2920369648

   > Not sure why this would cause the ci failures that we see here. Maybe 
defer this until some more of the known issues are fixed?
   
   Some of the tests need updating because they make assumptions based on the 
configured default scan and have not been updated to handle the new "auto" 
option.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]

2025-05-21 Thread via GitHub


parthchandra commented on PR #1747:
URL: 
https://github.com/apache/datafusion-comet/pull/1747#issuecomment-2899517869

   Not sure why this would cause the ci failures that we see here. Maybe defer 
this until some more of the known issues are fixed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]

2025-05-21 Thread via GitHub


parthchandra commented on code in PR #1747:
URL: https://github.com/apache/datafusion-comet/pull/1747#discussion_r2101363175


##
spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala:
##
@@ -93,21 +93,63 @@ case class CometScanRule(session: SparkSession) extends 
Rule[SparkPlan] {
   return withInfos(scanExec, fallbackReasons.toSet)
 }
 
-val scanImpl = COMET_NATIVE_SCAN_IMPL.get()
-if (scanImpl == CometConf.SCAN_NATIVE_DATAFUSION && 
!COMET_EXEC_ENABLED.get()) {
+var scanImpl = COMET_NATIVE_SCAN_IMPL.get()
+
+// if scan is auto then pick best available scan
+if (scanImpl == SCAN_AUTO) {
+  // TODO these checks are not yet exhaustive. For example, 
native_datafusion does
+  //  not support reading from object stores such as S3 yet
+
+  val typeChecker = CometScanTypeChecker(SCAN_NATIVE_ICEBERG_COMPAT)
+  val schemaSupported =
+typeChecker.isSchemaSupported(scanExec.requiredSchema, 
fallbackReasons)
+  val partitionSchemaSupported =
+typeChecker.isSchemaSupported(r.partitionSchema, fallbackReasons)
+
+  // additional checks for known issues
+  def isComplexType(dt: DataType): Boolean = dt match {
+case _: StructType | _: ArrayType | _: MapType => true
+case _ => false
+  }
+
+  def hasKnownIssues(dataType: DataType): Boolean = {
+dataType match {
+  case s: StructType => s.exists(field => 
hasKnownIssues(field.dataType))
+  case a: ArrayType => hasKnownIssues(a.elementType)
+  case m: MapType => isComplexType(m.keyType) || 
isComplexType(m.valueType)
+  case _ => false
+}
+  }
+
+  val knownIssues =
+scanExec.requiredSchema.exists(field => 
hasKnownIssues(field.dataType)) ||
+  r.partitionSchema.exists(field => hasKnownIssues(field.dataType))
+
+  if (COMET_EXEC_ENABLED
+  .get() && schemaSupported && partitionSchemaSupported &&
+!scanExec.bucketedScan && !knownIssues) {
+scanImpl = SCAN_NATIVE_ICEBERG_COMPAT

Review Comment:
   native_iceberg_compat should be able to handle bucketed scans



##
spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala:
##
@@ -93,21 +93,63 @@ case class CometScanRule(session: SparkSession) extends 
Rule[SparkPlan] {
   return withInfos(scanExec, fallbackReasons.toSet)
 }
 
-val scanImpl = COMET_NATIVE_SCAN_IMPL.get()
-if (scanImpl == CometConf.SCAN_NATIVE_DATAFUSION && 
!COMET_EXEC_ENABLED.get()) {
+var scanImpl = COMET_NATIVE_SCAN_IMPL.get()
+
+// if scan is auto then pick best available scan
+if (scanImpl == SCAN_AUTO) {
+  // TODO these checks are not yet exhaustive. For example, 
native_datafusion does
+  //  not support reading from object stores such as S3 yet
+
+  val typeChecker = CometScanTypeChecker(SCAN_NATIVE_ICEBERG_COMPAT)
+  val schemaSupported =
+typeChecker.isSchemaSupported(scanExec.requiredSchema, 
fallbackReasons)
+  val partitionSchemaSupported =
+typeChecker.isSchemaSupported(r.partitionSchema, fallbackReasons)
+
+  // additional checks for known issues
+  def isComplexType(dt: DataType): Boolean = dt match {
+case _: StructType | _: ArrayType | _: MapType => true
+case _ => false
+  }
+
+  def hasKnownIssues(dataType: DataType): Boolean = {
+dataType match {
+  case s: StructType => s.exists(field => 
hasKnownIssues(field.dataType))
+  case a: ArrayType => hasKnownIssues(a.elementType)
+  case m: MapType => isComplexType(m.keyType) || 
isComplexType(m.valueType)
+  case _ => false
+}
+  }
+
+  val knownIssues =
+scanExec.requiredSchema.exists(field => 
hasKnownIssues(field.dataType)) ||
+  r.partitionSchema.exists(field => hasKnownIssues(field.dataType))
+
+  if (COMET_EXEC_ENABLED
+  .get() && schemaSupported && partitionSchemaSupported &&
+!scanExec.bucketedScan && !knownIssues) {
+scanImpl = SCAN_NATIVE_ICEBERG_COMPAT
+  }
+}
+
+if (scanImpl == SCAN_AUTO) {
+  scanImpl = SCAN_NATIVE_COMET

Review Comment:
   We would never choose native_datafusion?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional

Re: [PR] feat: Add auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]

2025-05-20 Thread via GitHub


andygrove commented on PR #1747:
URL: 
https://github.com/apache/datafusion-comet/pull/1747#issuecomment-2895481116

   @parthchandra @mbutrovich This is ready for review now. I don't know if we 
want to keep in draft until more complete or merge and iterate. I also did not 
make auto the default yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]

2025-05-20 Thread via GitHub


andygrove commented on code in PR #1747:
URL: https://github.com/apache/datafusion-comet/pull/1747#discussion_r2098652013


##
docs/source/user-guide/compatibility.md:
##
@@ -29,12 +29,6 @@ Comet aims to provide consistent results with the version of 
Apache Spark that i
 
 This guide offers information about areas of functionality where there are 
known differences.
 
-# Compatibility Guide
-
-Comet aims to provide consistent results with the version of Apache Spark that 
is being used.
-
-This guide offers information about areas of functionality where there are 
known differences.
-

Review Comment:
   This section appeared twice



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]

2025-05-19 Thread via GitHub


parthchandra commented on PR #1747:
URL: 
https://github.com/apache/datafusion-comet/pull/1747#issuecomment-2891757995

   Looking good so far. Will do a final review once it is ready.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]

2025-05-17 Thread via GitHub


codecov-commenter commented on PR #1747:
URL: 
https://github.com/apache/datafusion-comet/pull/1747#issuecomment-2888529840

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   Attention: Patch coverage is `51.61290%` with `30 lines` in your changes 
missing coverage. Please review.
   > Project coverage is 57.20%. Comparing base 
[(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`64876cd`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/64876cd1a899ab1de9eef62415bc8e08cafa74c4?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 195 commits behind head on main.
   
   | [Files with missing 
lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 | Patch % | Lines |
   |---|---|---|
   | 
[...n/scala/org/apache/comet/rules/CometScanRule.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Frules%2FCometScanRule.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9ydWxlcy9Db21ldFNjYW5SdWxlLnNjYWxh)
 | 34.61% | [11 Missing and 6 partials :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   | 
[.../scala/org/apache/comet/serde/QueryPlanSerde.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fserde%2FQueryPlanSerde.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9zZXJkZS9RdWVyeVBsYW5TZXJkZS5zY2FsYQ==)
 | 47.05% | [3 Missing and 6 partials :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   | 
[...ala/org/apache/spark/sql/comet/CometScanExec.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fcomet%2FCometScanExec.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9zcGFyay9zcWwvY29tZXQvQ29tZXRTY2FuRXhlYy5zY2FsYQ==)
 | 75.00% | [0 Missing and 2 partials :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   | 
[.../apache/comet/parquet/CometParquetFileFormat.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fparquet%2FCometParquetFileFormat.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L0NvbWV0UGFycXVldEZpbGVGb3JtYXQuc2NhbGE=)
 | 66.66% | [0 Missing and 1 partial :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   | 
[...n/scala/org/apache/comet/rules/CometExecRule.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Frules%2FCometExecRule.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9ydWxlcy9Db21ldEV4ZWNSdWxlLnNjYWxh)
 | 0.00% | [0 Missing and 1 partial :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   
   Additional details and impacted files
   
   
   ```diff
   @@ Coverage Diff  @@
   ##   main#1747  +/-   ##
   
   + Coverage 56.12%   57.20%   +1.08% 
   - Complexity  976 1068  +92 
   
 Files   119  130  +11 
 Lines 11743