viirya commented on a change in pull request #31413:
URL: https://github.com/apache/spark/pull/31413#discussion_r567639762



##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
##########
@@ -591,20 +590,34 @@ case class FileSourceScanExec(
     logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, 
" +
       s"open cost is considered as scanning $openCostInBytes bytes.")
 
+    // Filter files with bucket pruning if possible
+    val filePruning: Path => Boolean = optionalBucketSet match {
+      case Some(bucketSet) =>
+        filePath => bucketSet.get(BucketingUtils.getBucketId(filePath.getName)
+          .getOrElse(sys.error(s"Invalid bucket file $filePath")))

Review comment:
       Your reason sounds reasonable. But it still sounds like a potential 
breaking change. I'm not sure if some users read table in this way, but it is 
indeed possible. It will very confuse to them as they won't ask to read table 
in bucketed way and it works previously.
   
   If eventually we still decide to fail the query, then a good to understand 
error message is necessary to let users know why we use bucketed spec here and 
why Spark fails to read the table. Maybe we should also provide some hints for 
solving it, if possible.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to