maropu commented on a change in pull request #31413:
URL: https://github.com/apache/spark/pull/31413#discussion_r567798637



##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
##########
@@ -591,20 +590,34 @@ case class FileSourceScanExec(
     logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, 
" +
       s"open cost is considered as scanning $openCostInBytes bytes.")
 
+    // Filter files with bucket pruning if possible
+    val filePruning: Path => Boolean = optionalBucketSet match {
+      case Some(bucketSet) =>
+        filePath => bucketSet.get(BucketingUtils.getBucketId(filePath.getName)
+          .getOrElse(sys.error(s"Invalid bucket file $filePath")))

Review comment:
       We already have some options to control missing & corrupted files for 
data sources, so how about following the semantics?
   
https://github.com/apache/spark/blob/4e7e7ee6e5a46cdc9c402f860ef942fde4f831a5/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L1282-L1298
   
   For the data corruption case above, how about throwing an exception by 
default and then stating in the exception message that you should use a 
specified option if you want to ignore it? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to