[ https://issues.apache.org/jira/browse/HADOOP-13449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613426#comment-15613426 ]
Lei (Eddy) Xu commented on HADOOP-13449: ---------------------------------------- Good discussion, [~liuml07] and [~fabbri] bq. The contract assumes we create the direct parent directory (other ancestors should be taken care of by the clients/callers) when putting a new file item. I checked the in-memory local metadata store and it implements this idea. This may be not efficient to DDB. Basically for putting X items, we have to issue 2X~3X DDB requests (X for putting file, X for checking its parent directories, and possible X for updating its parent directories). I'm wondering if we can also let the client/caller pre-create the direct parent directory as other ancestors. I suggest to consider this into two aspects: * Checking parents directories in normal {{S3AFileSystem}} operations (i.e., create / mkdirs ). In such case, S3AFileSystem should already ensure the invariant of the contracts(the parent directories existed before S3AFileSystem starts to create files on S3). * Loading files and directories outside of normal {{S3AFileSystem}} operations, e.g., load a *non-cached* directory or from CLI tool, in such cases, would a small local "dentry_cache" types of data structure be sufficient for a batch operation? Because these operations can ensure that the namespace structure exists on S3 already. The last resort is, if {{S3AFileSystem}} considers that it is safe to {{create / mkdir}} on a path. You can always create all its parent directories in a single batch to dynamodb. In short, I'd suggest to let {{S3AFileSystem}} ensure the contract. bq. We store the is_empty for directory in the DynamoDB (DDB) metadata store now. We have to update this information in a consistent and efficient way. We don't want to check the parent directory every time we delete/put a file item. At least we can optimize this when deleting a subtree. Another way to do it is letting the {{isEmpty()}} flag being set by issuing a small _additional_ query on the directory with a [Limit=1|http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#ScanQueryLimit]. So if it returns more than 1 result, the {{isEmpty}} flag is false, otherwise, the flag is true. And this value can be cached with the lifetime of {{S3AFileStatus}}, as it can not reliably reflect the changes in S3 anyway. So the query cost only occurs when you call the {{IsEmpty()}} for the first time. And you don't need to update this flag for any S3 writes. Hope that works. > S3Guard: Implement DynamoDBMetadataStore. > ----------------------------------------- > > Key: HADOOP-13449 > URL: https://issues.apache.org/jira/browse/HADOOP-13449 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Reporter: Chris Nauroth > Assignee: Mingliang Liu > Attachments: HADOOP-13449-HADOOP-13345.000.patch, > HADOOP-13449-HADOOP-13345.001.patch > > > Provide an implementation of the metadata store backed by DynamoDB. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org