Re: [PR] feat: `auto` scan mode should check for supported file location [datafusion-comet]
parthchandra commented on code in PR #1930:
URL: https://github.com/apache/datafusion-comet/pull/1930#discussion_r2167875652
##
dev/diffs/3.4.3.diff:
##
@@ -2868,6 +2868,28 @@ index 52abd248f3a..7a199931a08 100644
case h: HiveTableScanExec => h.partitionPruningPred.collect {
case d: DynamicPruningExpression => d.child
}
+diff --git
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
+index de3b1ffccf0..2a76d127093 100644
+---
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
+@@ -23,14 +23,15 @@ import java.util.concurrent.{Executors, TimeUnit}
+ import org.scalatest.BeforeAndAfterEach
+
+ import org.apache.spark.metrics.source.HiveCatalogMetrics
+-import org.apache.spark.sql.QueryTest
++import org.apache.spark.sql.{IgnoreCometSuite, QueryTest}
+ import org.apache.spark.sql.execution.datasources.FileStatusCache
+ import org.apache.spark.sql.hive.test.TestHiveSingleton
+ import org.apache.spark.sql.internal.SQLConf
+ import org.apache.spark.sql.test.SQLTestUtils
+
+ class PartitionedTablePerfStatsSuite
+- extends QueryTest with TestHiveSingleton with SQLTestUtils with
BeforeAndAfterEach {
++ extends QueryTest with TestHiveSingleton with SQLTestUtils with
BeforeAndAfterEach
++with IgnoreCometSuite {
Review Comment:
> so is no longer compatible with Comet
Why should that be? The file metadata is read before Comet kicks in.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: `auto` scan mode should check for supported file location [datafusion-comet]
andygrove commented on code in PR #1930:
URL: https://github.com/apache/datafusion-comet/pull/1930#discussion_r2167849940
##
dev/diffs/3.4.3.diff:
##
@@ -2868,6 +2868,28 @@ index 52abd248f3a..7a199931a08 100644
case h: HiveTableScanExec => h.partitionPruningPred.collect {
case d: DynamicPruningExpression => d.child
}
+diff --git
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
+index de3b1ffccf0..2a76d127093 100644
+---
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
+@@ -23,14 +23,15 @@ import java.util.concurrent.{Executors, TimeUnit}
+ import org.scalatest.BeforeAndAfterEach
+
+ import org.apache.spark.metrics.source.HiveCatalogMetrics
+-import org.apache.spark.sql.QueryTest
++import org.apache.spark.sql.{IgnoreCometSuite, QueryTest}
+ import org.apache.spark.sql.execution.datasources.FileStatusCache
+ import org.apache.spark.sql.hive.test.TestHiveSingleton
+ import org.apache.spark.sql.internal.SQLConf
+ import org.apache.spark.sql.test.SQLTestUtils
+
+ class PartitionedTablePerfStatsSuite
+- extends QueryTest with TestHiveSingleton with SQLTestUtils with
BeforeAndAfterEach {
++ extends QueryTest with TestHiveSingleton with SQLTestUtils with
BeforeAndAfterEach
++with IgnoreCometSuite {
Review Comment:
We ignore this suite because it is using stats to measure access to file
metadata, so is no longer compatible with Comet. The test is not testing query
execution.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: `auto` scan mode should check for supported file location [datafusion-comet]
parthchandra commented on code in PR #1930:
URL: https://github.com/apache/datafusion-comet/pull/1930#discussion_r2167792716
##
spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala:
##
@@ -258,11 +258,15 @@ case class CometScanRule(session: SparkSession) extends
Rule[SparkPlan] {
}
private def selectScan(scanExec: FileSourceScanExec, partitionSchema:
StructType): String = {
-// TODO these checks are not yet exhaustive. For example,
native_iceberg_compat does
-// not support reading from S3
val fallbackReasons = new ListBuffer[String]()
+// native_iceberg_compat only supports local filesystem and S3
+if (!scanExec.relation.inputFiles
+.forall(path => path.startsWith("file://") ||
path.startsWith("s3a://"))) {
Review Comment:
This is the only way to get the file names afaik.
I don't think this adds too much overhead. Also, the file names have to come
from either an InMemoryFileIndex (built by scanning a path) or from a table
definition which must come from a catalog. So reading the catalog is
unavoidable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: `auto` scan mode should check for supported file location [datafusion-comet]
andygrove commented on code in PR #1930:
URL: https://github.com/apache/datafusion-comet/pull/1930#discussion_r2167730097
##
spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala:
##
@@ -258,11 +258,15 @@ case class CometScanRule(session: SparkSession) extends
Rule[SparkPlan] {
}
private def selectScan(scanExec: FileSourceScanExec, partitionSchema:
StructType): String = {
-// TODO these checks are not yet exhaustive. For example,
native_iceberg_compat does
-// not support reading from S3
val fallbackReasons = new ListBuffer[String]()
+// native_iceberg_compat only supports local filesystem and S3
+if (!scanExec.relation.inputFiles
+.forall(path => path.startsWith("file://") ||
path.startsWith("s3a://"))) {
Review Comment:
@parthchandra I may need some guidance here. Just looking at the input files
causes some stats to be updated (such as
`HiveCatalogMetrics.METRIC_FILES_DISCOVERED` and
`HiveCatalogMetrics.METRIC_FILE_CACHE_HITS`), leading to some test failures. I
wonder if this could be adding some overhead.
Do you know if there is a more efficient way for us to check for supported
file locations?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: `auto` scan mode should check for supported file location [datafusion-comet]
codecov-commenter commented on PR #1930: URL: https://github.com/apache/datafusion-comet/pull/1930#issuecomment-3005204578 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1930?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report Attention: Patch coverage is `0%` with `3 lines` in your changes missing coverage. Please review. > Project coverage is 33.25%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`4050677`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/405067719358766f462ef51a460d15f1ff182b37?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 285 commits behind head on main. | [Files with missing lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/1930?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines | |---|---|---| | [...n/scala/org/apache/comet/rules/CometScanRule.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1930?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Frules%2FCometScanRule.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9ydWxlcy9Db21ldFNjYW5SdWxlLnNjYWxh) | 0.00% | [3 Missing :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1930?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1930 +/- ## = - Coverage 56.12% 33.25% -22.88% + Complexity 976 800 -176 = Files 119 130 +11 Lines 1174312771 +1028 Branches 2251 2384 +133 = - Hits 6591 4247 -2344 - Misses 4012 7577 +3565 + Partials 1140 947 -193 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1930?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :rocket: New features to boost your workflow: - :snowflake: [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
