Re: [PR] feat: `auto` scan mode should check for supported file location [datafusion-comet]

2025-06-25 Thread via GitHub


parthchandra commented on code in PR #1930:
URL: https://github.com/apache/datafusion-comet/pull/1930#discussion_r2167875652


##
dev/diffs/3.4.3.diff:
##
@@ -2868,6 +2868,28 @@ index 52abd248f3a..7a199931a08 100644
case h: HiveTableScanExec => h.partitionPruningPred.collect {
  case d: DynamicPruningExpression => d.child
}
+diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
+index de3b1ffccf0..2a76d127093 100644
+--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
+@@ -23,14 +23,15 @@ import java.util.concurrent.{Executors, TimeUnit}
+ import org.scalatest.BeforeAndAfterEach
+ 
+ import org.apache.spark.metrics.source.HiveCatalogMetrics
+-import org.apache.spark.sql.QueryTest
++import org.apache.spark.sql.{IgnoreCometSuite, QueryTest}
+ import org.apache.spark.sql.execution.datasources.FileStatusCache
+ import org.apache.spark.sql.hive.test.TestHiveSingleton
+ import org.apache.spark.sql.internal.SQLConf
+ import org.apache.spark.sql.test.SQLTestUtils
+ 
+ class PartitionedTablePerfStatsSuite
+-  extends QueryTest with TestHiveSingleton with SQLTestUtils with 
BeforeAndAfterEach {
++  extends QueryTest with TestHiveSingleton with SQLTestUtils with 
BeforeAndAfterEach
++with IgnoreCometSuite {

Review Comment:
   > so is no longer compatible with Comet
   
   Why should that be? The file metadata is read before Comet kicks in. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: `auto` scan mode should check for supported file location [datafusion-comet]

2025-06-25 Thread via GitHub


andygrove commented on code in PR #1930:
URL: https://github.com/apache/datafusion-comet/pull/1930#discussion_r2167849940


##
dev/diffs/3.4.3.diff:
##
@@ -2868,6 +2868,28 @@ index 52abd248f3a..7a199931a08 100644
case h: HiveTableScanExec => h.partitionPruningPred.collect {
  case d: DynamicPruningExpression => d.child
}
+diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
+index de3b1ffccf0..2a76d127093 100644
+--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
+@@ -23,14 +23,15 @@ import java.util.concurrent.{Executors, TimeUnit}
+ import org.scalatest.BeforeAndAfterEach
+ 
+ import org.apache.spark.metrics.source.HiveCatalogMetrics
+-import org.apache.spark.sql.QueryTest
++import org.apache.spark.sql.{IgnoreCometSuite, QueryTest}
+ import org.apache.spark.sql.execution.datasources.FileStatusCache
+ import org.apache.spark.sql.hive.test.TestHiveSingleton
+ import org.apache.spark.sql.internal.SQLConf
+ import org.apache.spark.sql.test.SQLTestUtils
+ 
+ class PartitionedTablePerfStatsSuite
+-  extends QueryTest with TestHiveSingleton with SQLTestUtils with 
BeforeAndAfterEach {
++  extends QueryTest with TestHiveSingleton with SQLTestUtils with 
BeforeAndAfterEach
++with IgnoreCometSuite {

Review Comment:
   We ignore this suite because it is using stats to measure access to file 
metadata, so is no longer compatible with Comet. The test is not testing query 
execution.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: `auto` scan mode should check for supported file location [datafusion-comet]

2025-06-25 Thread via GitHub


parthchandra commented on code in PR #1930:
URL: https://github.com/apache/datafusion-comet/pull/1930#discussion_r2167792716


##
spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala:
##
@@ -258,11 +258,15 @@ case class CometScanRule(session: SparkSession) extends 
Rule[SparkPlan] {
   }
 
   private def selectScan(scanExec: FileSourceScanExec, partitionSchema: 
StructType): String = {
-// TODO these checks are not yet exhaustive. For example, 
native_iceberg_compat does
-//  not support reading from S3
 
 val fallbackReasons = new ListBuffer[String]()
 
+// native_iceberg_compat only supports local filesystem and S3
+if (!scanExec.relation.inputFiles
+.forall(path => path.startsWith("file://") || 
path.startsWith("s3a://"))) {

Review Comment:
   This is the only way to get the file names afaik. 
   I don't think this adds too much overhead. Also, the file names have to come 
from either an InMemoryFileIndex (built by scanning a path) or from a table 
definition which must come from a catalog. So reading the catalog is 
unavoidable. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: `auto` scan mode should check for supported file location [datafusion-comet]

2025-06-25 Thread via GitHub


andygrove commented on code in PR #1930:
URL: https://github.com/apache/datafusion-comet/pull/1930#discussion_r2167730097


##
spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala:
##
@@ -258,11 +258,15 @@ case class CometScanRule(session: SparkSession) extends 
Rule[SparkPlan] {
   }
 
   private def selectScan(scanExec: FileSourceScanExec, partitionSchema: 
StructType): String = {
-// TODO these checks are not yet exhaustive. For example, 
native_iceberg_compat does
-//  not support reading from S3
 
 val fallbackReasons = new ListBuffer[String]()
 
+// native_iceberg_compat only supports local filesystem and S3
+if (!scanExec.relation.inputFiles
+.forall(path => path.startsWith("file://") || 
path.startsWith("s3a://"))) {

Review Comment:
   @parthchandra I may need some guidance here. Just looking at the input files 
causes some stats to be updated (such as 
`HiveCatalogMetrics.METRIC_FILES_DISCOVERED` and 
`HiveCatalogMetrics.METRIC_FILE_CACHE_HITS`), leading to some test failures. I 
wonder if this could be adding some overhead.
   
   Do you know if there is a more efficient way for us to check for supported 
file locations?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: `auto` scan mode should check for supported file location [datafusion-comet]

2025-06-25 Thread via GitHub


codecov-commenter commented on PR #1930:
URL: 
https://github.com/apache/datafusion-comet/pull/1930#issuecomment-3005204578

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1930?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   Attention: Patch coverage is `0%` with `3 lines` in your changes missing 
coverage. Please review.
   > Project coverage is 33.25%. Comparing base 
[(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`4050677`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/405067719358766f462ef51a460d15f1ff182b37?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 285 commits behind head on main.
   
   | [Files with missing 
lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/1930?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 | Patch % | Lines |
   |---|---|---|
   | 
[...n/scala/org/apache/comet/rules/CometScanRule.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1930?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Frules%2FCometScanRule.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9ydWxlcy9Db21ldFNjYW5SdWxlLnNjYWxh)
 | 0.00% | [3 Missing :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/1930?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   
   Additional details and impacted files
   
   
   ```diff
   @@  Coverage Diff  @@
   ##   main#1930   +/-   ##
   =
   - Coverage 56.12%   33.25%   -22.88% 
   + Complexity  976  800  -176 
   =
 Files   119  130   +11 
 Lines 1174312771 +1028 
 Branches   2251 2384  +133 
   =
   - Hits   6591 4247 -2344 
   - Misses 4012 7577 +3565 
   + Partials   1140  947  -193 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1930?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
:rocket: New features to boost your workflow: 
   
   - :snowflake: [Test 
Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, 
report on failures, and find test suite problems.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]