[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822031#comment-16822031 ]
Íñigo Goiri commented on HDFS-14440: ------------------------------------ I think we should try to exploit hashing instead of invoking everywhere. There are two cases here: * Old approach: ** The file already exists: we check in the one that we expect it to be. This is 1 RPC calls. ** The file is new: we go over all the subclusters one by one. This is N RPC calls in N times. * The approach in [^HDFS-14440-HDFS-13891-01.patch]: ** The file already exists: we check everywhere. This is N RPC calls in 1 time. ** The file is new: we check everywhere. This is N RPC calls in 1 time. I personally prefer the old approach. The new one reduces the time for an operation that is not that critical. There could even be a hybrid where we check in the one we expect and then concurrently in the rest. Are you seeing this as a bottleneck? > RBF: Optimize the file write process in case of multiple destinations. > ---------------------------------------------------------------------- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Ayush Saxena > Assignee: Ayush Saxena > Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org