rdblue commented on a change in pull request #4307:
URL: https://github.com/apache/iceberg/pull/4307#discussion_r835978023



##########
File path: 
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java
##########
@@ -273,4 +285,35 @@ private static void listDirRecursively(
       return files.iterator();
     };
   }
+
+  @VisibleForTesting
+  static class PartitionAwareHiddenPathFilter implements PathFilter, 
Serializable {
+
+    private final Set<String> hiddenPathPartitionNames;
+
+    PartitionAwareHiddenPathFilter(Set<String> hiddenPathPartitionNames) {
+      this.hiddenPathPartitionNames = hiddenPathPartitionNames;
+    }
+
+    @Override
+    public boolean accept(Path path) {
+      boolean isHiddenPartitionPath = 
hiddenPathPartitionNames.stream().anyMatch(path.getName()::startsWith);
+      return isHiddenPartitionPath || HiddenPathFilter.get().accept(path);
+    }
+
+    static PathFilter build(Map<Integer, PartitionSpec> specs) {
+      if (specs == null) {
+        return HiddenPathFilter.get();
+      }
+
+      Set<String> partitionNames = specs.values().stream()
+          .map(PartitionSpec::fields)
+          .flatMap(List::stream)
+          .filter(partitionField -> partitionField.name().startsWith("_") || 
partitionField.name().startsWith("."))
+          .map(PartitionField::name)
+          .collect(Collectors.toSet());

Review comment:
       Looks like this will match prefixes and doesn't care about order. So a 
partition spec with names `_a` and `_b` would result in accepting paths like 
`_ahiddenfile` and `_b/_b` (out of order and/or duplicated).
   
   I'm not sure how much complexity is correct here, but I find this a bit too 
permissive. The problem is that these are the files that are going to be 
candidates for orphan cleanup, so allowing those cases could result in files 
getting deleted.
   
   I think a good compromise is to add `=` to each partition field name to use 
as a prefix: `_a` becomes `_a=`. That way, we at least check that the name is 
exactly `_a` and the folder name looks like a partition.
   
   Ideally, this would also be able to check for `_a` and `_b` a the correct 
level in the path, but I don't think that's possible without a lot of work. So 
let's leave it out.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to