karuppayya commented on code in PR #12278:
URL: https://github.com/apache/iceberg/pull/12278#discussion_r1958822862
##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##########
@@ -335,6 +347,21 @@ private Dataset<String> listedFileDS() {
return spark().createDataset(completeMatchingFileRDD.rdd(),
Encoders.STRING());
}
+ private String dataLocation() {
Review Comment:
Currently `org.apache.iceberg.io.LocationProvider` has APIs to only operate
on data location. Also APIs needs an input in fileName.
Is the suggestion to enhance the LocationProviders to include more APIs for
data and metadata location?
##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##########
@@ -301,24 +303,34 @@ private Dataset<FileURI> actualFileIdentDS() {
private Dataset<String> listedFileDS() {
List<String> subDirs = Lists.newArrayList();
- List<String> matchingFiles = Lists.newArrayList();
+ Set<String> matchingFiles = Sets.newHashSet();
Predicate<FileStatus> predicate = file -> file.getModificationTime() <
olderThanTimestamp;
PathFilter pathFilter =
PartitionAwareHiddenPathFilter.forSpecs(table.specs());
+ List<String> locationsToList = Lists.newArrayList();
+ if (location.equals(table.location())) {
+ locationsToList.add(dataLocation());
+ locationsToList.add(metadataFileLocation());
+ } else {
+ locationsToList.add(location);
+ }
Review Comment:
This is the existing behavior to check the given location instead of table
location, when specified via
`org.apache.iceberg.actions.DeleteOrphanFiles#location`
On a related note, the current code doesn't check
`org.apache.iceberg.TableProperties#WRITE_DATA_LOCATION`,
`org.apache.iceberg.TableProperties#WRITE_METADATA_LOCATION`
which this PR check.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]