[GitHub] [hadoop] bgaborg commented on a change in pull request #1810: HADOOP-16746 mkdirs and s3guard auth mode

2020-01-21 Thread GitBox
bgaborg commented on a change in pull request #1810: HADOOP-16746 mkdirs and 
s3guard auth mode
URL: https://github.com/apache/hadoop/pull/1810#discussion_r368971623
 
 

 ##
 File path: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/s3guard/S3Guard.java
 ##
 @@ -299,29 +313,72 @@ public static BulkOperationState initiateBulkWrite(
 // Since the authoritative case is already handled outside this function,
 // we will basically start with the set of directory entries in the
 // DirListingMetadata, and add any that only exist in the backingStatuses.
+//
+// We try to avoid writing any more child entries than need be to :-
+//  (a) save time and money.
+//  (b) avoid overwriting the authoritative bit of children (HADOOP-16746).
+// For auth mode updates, we supply the full listing and a list of which
+// child entries have not been changed; the store gets to optimize its
+// update however it chooses.
+//
+// for non-auth-mode S3Guard, we just build a list of entries to add and
+// submit them in a batch; this is more efficient than trickling out the
+// updates one-by-one.
+
+// track all unchanged entries; used so the metastore can identify entries
+// it doesn't need to update
+List unchangedEntries = new ArrayList<>(dirMeta.getListing().size());
+List nonAuthEntriesToAdd = new 
ArrayList<>(backingStatuses.size());
 boolean changed = false;
-final Map dirMetaMap = dirMeta.getListing().stream()
-.collect(Collectors.toMap(
-pm -> pm.getFileStatus().getPath(), PathMetadata::getFileStatus)
-);
+final Map dirMetaMap = dirMeta.getListing().stream()
+.collect(Collectors.toMap(pm -> pm.getFileStatus().getPath(), pm -> 
pm));
 BulkOperationState operationState = ms.initiateBulkWrite(
 BulkOperationState.OperationType.Listing,
 path);
 for (S3AFileStatus s : backingStatuses) {
-  if (deleted.contains(s.getPath())) {
+  final Path statusPath = s.getPath();
+  if (deleted.contains(statusPath)) {
 continue;
   }
 
-  final PathMetadata pathMetadata = new PathMetadata(s);
-
-  if (!isAuthoritative){
-FileStatus status = dirMetaMap.get(s.getPath());
-if (status != null
-&& s.getModificationTime() > status.getModificationTime()) {
-  LOG.debug("Update ms with newer metadata of: {}", status);
-  S3Guard.putWithTtl(ms, pathMetadata, timeProvider, operationState);
+  final PathMetadata originalMD = dirMetaMap.get(statusPath);
+
+  // this is built up to be whatever entry is added to the dirMeta
+  // collection
+  PathMetadata pathMetadata = originalMD;
+
+  if (!isAuthoritative) {
+// in non-auth listings, we compare the file status of the metastore
+// list with those in the FS, and overwrite the MS entry if
+// either of two conditions are met
+// - there is no entry in the metadata and
 
 Review comment:
   nit: there is no entry in the _metastore_ and


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] bgaborg commented on a change in pull request #1810: HADOOP-16746 mkdirs and s3guard auth mode

2020-01-21 Thread GitBox
bgaborg commented on a change in pull request #1810: HADOOP-16746 mkdirs and 
s3guard auth mode
URL: https://github.com/apache/hadoop/pull/1810#discussion_r368986839
 
 

 ##
 File path: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/s3guard/S3Guard.java
 ##
 @@ -299,29 +313,72 @@ public static BulkOperationState initiateBulkWrite(
 // Since the authoritative case is already handled outside this function,
 // we will basically start with the set of directory entries in the
 // DirListingMetadata, and add any that only exist in the backingStatuses.
+//
+// We try to avoid writing any more child entries than need be to :-
+//  (a) save time and money.
+//  (b) avoid overwriting the authoritative bit of children (HADOOP-16746).
+// For auth mode updates, we supply the full listing and a list of which
+// child entries have not been changed; the store gets to optimize its
+// update however it chooses.
+//
+// for non-auth-mode S3Guard, we just build a list of entries to add and
+// submit them in a batch; this is more efficient than trickling out the
+// updates one-by-one.
+
+// track all unchanged entries; used so the metastore can identify entries
+// it doesn't need to update
+List unchangedEntries = new ArrayList<>(dirMeta.getListing().size());
+List nonAuthEntriesToAdd = new 
ArrayList<>(backingStatuses.size());
 boolean changed = false;
-final Map dirMetaMap = dirMeta.getListing().stream()
-.collect(Collectors.toMap(
-pm -> pm.getFileStatus().getPath(), PathMetadata::getFileStatus)
-);
+final Map dirMetaMap = dirMeta.getListing().stream()
+.collect(Collectors.toMap(pm -> pm.getFileStatus().getPath(), pm -> 
pm));
 BulkOperationState operationState = ms.initiateBulkWrite(
 BulkOperationState.OperationType.Listing,
 path);
 for (S3AFileStatus s : backingStatuses) {
-  if (deleted.contains(s.getPath())) {
+  final Path statusPath = s.getPath();
+  if (deleted.contains(statusPath)) {
 continue;
   }
 
-  final PathMetadata pathMetadata = new PathMetadata(s);
-
-  if (!isAuthoritative){
-FileStatus status = dirMetaMap.get(s.getPath());
-if (status != null
-&& s.getModificationTime() > status.getModificationTime()) {
-  LOG.debug("Update ms with newer metadata of: {}", status);
-  S3Guard.putWithTtl(ms, pathMetadata, timeProvider, operationState);
+  final PathMetadata originalMD = dirMetaMap.get(statusPath);
+
+  // this is built up to be whatever entry is added to the dirMeta
+  // collection
+  PathMetadata pathMetadata = originalMD;
+
+  if (!isAuthoritative) {
 
 Review comment:
   Could we factor out this to another method? Just to make Uncle Bob happy :).
   Jokes aside this grew really huge, and we should split the auth and non-auth 
execution paths to be more maintainable in the future. It was your idea in one 
of your comments on the PR, and that was a good idea. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] bgaborg commented on a change in pull request #1810: HADOOP-16746 mkdirs and s3guard auth mode

2020-01-21 Thread GitBox
bgaborg commented on a change in pull request #1810: HADOOP-16746 mkdirs and 
s3guard auth mode
URL: https://github.com/apache/hadoop/pull/1810#discussion_r368973683
 
 

 ##
 File path: 
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/s3guard/ITestDynamoDBMetadataStoreAuthoritativeMode.java
 ##
 @@ -55,7 +53,7 @@
 import static 
org.apache.hadoop.fs.s3a.S3ATestUtils.removeBaseAndBucketOverrides;
 import static org.apache.hadoop.fs.s3a.S3AUtils.applyLocatedFiles;
 import static org.apache.hadoop.fs.s3a.Statistic.OBJECT_LIST_REQUESTS;
-import static 
org.apache.hadoop.fs.s3a.Statistic.S3GUARD_METADATASTORE_AUTHORITATIVE_DIRECTORIES_UPDATED;
+import static 
org.apache.hadoop.fs.s3a.Statistic.S3GUARD_METADATASTORE_AUTHORITATIVE_DIRECTORIES_UPDATED;import
 static org.apache.hadoop.fs.s3a.Statistic.S3GUARD_METADATASTORE_RECORD_WRITES;
 
 Review comment:
   nit: import static in new line


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org