Re: [PR] feat: Add auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]
andygrove commented on PR #1747: URL: https://github.com/apache/datafusion-comet/pull/1747#issuecomment-2920369648 > Not sure why this would cause the ci failures that we see here. Maybe defer this until some more of the known issues are fixed? Some of the tests need updating because they make assumptions based on the configured default scan and have not been updated to handle the new "auto" option. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Add auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]
parthchandra commented on PR #1747: URL: https://github.com/apache/datafusion-comet/pull/1747#issuecomment-2899517869 Not sure why this would cause the ci failures that we see here. Maybe defer this until some more of the known issues are fixed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Add auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]
parthchandra commented on code in PR #1747:
URL: https://github.com/apache/datafusion-comet/pull/1747#discussion_r2101363175
##
spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala:
##
@@ -93,21 +93,63 @@ case class CometScanRule(session: SparkSession) extends
Rule[SparkPlan] {
return withInfos(scanExec, fallbackReasons.toSet)
}
-val scanImpl = COMET_NATIVE_SCAN_IMPL.get()
-if (scanImpl == CometConf.SCAN_NATIVE_DATAFUSION &&
!COMET_EXEC_ENABLED.get()) {
+var scanImpl = COMET_NATIVE_SCAN_IMPL.get()
+
+// if scan is auto then pick best available scan
+if (scanImpl == SCAN_AUTO) {
+ // TODO these checks are not yet exhaustive. For example,
native_datafusion does
+ // not support reading from object stores such as S3 yet
+
+ val typeChecker = CometScanTypeChecker(SCAN_NATIVE_ICEBERG_COMPAT)
+ val schemaSupported =
+typeChecker.isSchemaSupported(scanExec.requiredSchema,
fallbackReasons)
+ val partitionSchemaSupported =
+typeChecker.isSchemaSupported(r.partitionSchema, fallbackReasons)
+
+ // additional checks for known issues
+ def isComplexType(dt: DataType): Boolean = dt match {
+case _: StructType | _: ArrayType | _: MapType => true
+case _ => false
+ }
+
+ def hasKnownIssues(dataType: DataType): Boolean = {
+dataType match {
+ case s: StructType => s.exists(field =>
hasKnownIssues(field.dataType))
+ case a: ArrayType => hasKnownIssues(a.elementType)
+ case m: MapType => isComplexType(m.keyType) ||
isComplexType(m.valueType)
+ case _ => false
+}
+ }
+
+ val knownIssues =
+scanExec.requiredSchema.exists(field =>
hasKnownIssues(field.dataType)) ||
+ r.partitionSchema.exists(field => hasKnownIssues(field.dataType))
+
+ if (COMET_EXEC_ENABLED
+ .get() && schemaSupported && partitionSchemaSupported &&
+!scanExec.bucketedScan && !knownIssues) {
+scanImpl = SCAN_NATIVE_ICEBERG_COMPAT
Review Comment:
native_iceberg_compat should be able to handle bucketed scans
##
spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala:
##
@@ -93,21 +93,63 @@ case class CometScanRule(session: SparkSession) extends
Rule[SparkPlan] {
return withInfos(scanExec, fallbackReasons.toSet)
}
-val scanImpl = COMET_NATIVE_SCAN_IMPL.get()
-if (scanImpl == CometConf.SCAN_NATIVE_DATAFUSION &&
!COMET_EXEC_ENABLED.get()) {
+var scanImpl = COMET_NATIVE_SCAN_IMPL.get()
+
+// if scan is auto then pick best available scan
+if (scanImpl == SCAN_AUTO) {
+ // TODO these checks are not yet exhaustive. For example,
native_datafusion does
+ // not support reading from object stores such as S3 yet
+
+ val typeChecker = CometScanTypeChecker(SCAN_NATIVE_ICEBERG_COMPAT)
+ val schemaSupported =
+typeChecker.isSchemaSupported(scanExec.requiredSchema,
fallbackReasons)
+ val partitionSchemaSupported =
+typeChecker.isSchemaSupported(r.partitionSchema, fallbackReasons)
+
+ // additional checks for known issues
+ def isComplexType(dt: DataType): Boolean = dt match {
+case _: StructType | _: ArrayType | _: MapType => true
+case _ => false
+ }
+
+ def hasKnownIssues(dataType: DataType): Boolean = {
+dataType match {
+ case s: StructType => s.exists(field =>
hasKnownIssues(field.dataType))
+ case a: ArrayType => hasKnownIssues(a.elementType)
+ case m: MapType => isComplexType(m.keyType) ||
isComplexType(m.valueType)
+ case _ => false
+}
+ }
+
+ val knownIssues =
+scanExec.requiredSchema.exists(field =>
hasKnownIssues(field.dataType)) ||
+ r.partitionSchema.exists(field => hasKnownIssues(field.dataType))
+
+ if (COMET_EXEC_ENABLED
+ .get() && schemaSupported && partitionSchemaSupported &&
+!scanExec.bucketedScan && !knownIssues) {
+scanImpl = SCAN_NATIVE_ICEBERG_COMPAT
+ }
+}
+
+if (scanImpl == SCAN_AUTO) {
+ scanImpl = SCAN_NATIVE_COMET
Review Comment:
We would never choose native_datafusion?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional
Re: [PR] feat: Add auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]
andygrove commented on PR #1747: URL: https://github.com/apache/datafusion-comet/pull/1747#issuecomment-2895481116 @parthchandra @mbutrovich This is ready for review now. I don't know if we want to keep in draft until more complete or merge and iterate. I also did not make auto the default yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Add auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]
andygrove commented on code in PR #1747: URL: https://github.com/apache/datafusion-comet/pull/1747#discussion_r2098652013 ## docs/source/user-guide/compatibility.md: ## @@ -29,12 +29,6 @@ Comet aims to provide consistent results with the version of Apache Spark that i This guide offers information about areas of functionality where there are known differences. -# Compatibility Guide - -Comet aims to provide consistent results with the version of Apache Spark that is being used. - -This guide offers information about areas of functionality where there are known differences. - Review Comment: This section appeared twice -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Add auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]
parthchandra commented on PR #1747: URL: https://github.com/apache/datafusion-comet/pull/1747#issuecomment-2891757995 Looking good so far. Will do a final review once it is ready. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Add auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]
codecov-commenter commented on PR #1747: URL: https://github.com/apache/datafusion-comet/pull/1747#issuecomment-2888529840 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report Attention: Patch coverage is `51.61290%` with `30 lines` in your changes missing coverage. Please review. > Project coverage is 57.20%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`64876cd`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/64876cd1a899ab1de9eef62415bc8e08cafa74c4?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 195 commits behind head on main. | [Files with missing lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines | |---|---|---| | [...n/scala/org/apache/comet/rules/CometScanRule.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Frules%2FCometScanRule.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9ydWxlcy9Db21ldFNjYW5SdWxlLnNjYWxh) | 34.61% | [11 Missing and 6 partials :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [.../scala/org/apache/comet/serde/QueryPlanSerde.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fserde%2FQueryPlanSerde.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9zZXJkZS9RdWVyeVBsYW5TZXJkZS5zY2FsYQ==) | 47.05% | [3 Missing and 6 partials :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [...ala/org/apache/spark/sql/comet/CometScanExec.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fcomet%2FCometScanExec.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9zcGFyay9zcWwvY29tZXQvQ29tZXRTY2FuRXhlYy5zY2FsYQ==) | 75.00% | [0 Missing and 2 partials :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [.../apache/comet/parquet/CometParquetFileFormat.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fparquet%2FCometParquetFileFormat.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L0NvbWV0UGFycXVldEZpbGVGb3JtYXQuc2NhbGE=) | 66.66% | [0 Missing and 1 partial :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [...n/scala/org/apache/comet/rules/CometExecRule.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Frules%2FCometExecRule.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9ydWxlcy9Db21ldEV4ZWNSdWxlLnNjYWxh) | 0.00% | [0 Missing and 1 partial :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1747?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1747 +/- ## + Coverage 56.12% 57.20% +1.08% - Complexity 976 1068 +92 Files 119 130 +11 Lines 11743
