viirya commented on a change in pull request #31413: URL: https://github.com/apache/spark/pull/31413#discussion_r567639762
########## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ########## @@ -591,20 +590,34 @@ case class FileSourceScanExec( logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, " + s"open cost is considered as scanning $openCostInBytes bytes.") + // Filter files with bucket pruning if possible + val filePruning: Path => Boolean = optionalBucketSet match { + case Some(bucketSet) => + filePath => bucketSet.get(BucketingUtils.getBucketId(filePath.getName) + .getOrElse(sys.error(s"Invalid bucket file $filePath"))) Review comment: Your reason sounds reasonable. But it still sounds like a potential breaking change. I'm not sure if some users read table in this way, but it is indeed possible. It will very confuse to them as they won't ask to read table in bucketed way and it works previously. If eventually we still decide to fail the query, then a good to understand error message is necessary to let users know why we use bucketed spec here and why Spark fails to read the table. Maybe we should also provide some hints for solving it, if possible. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org