Subramanyam Pattipaka created HADOOP-13403:
----------------------------------------------

             Summary: AzureNativeFileSystem rename/delete performance 
improvements
                 Key: HADOOP-13403
                 URL: https://issues.apache.org/jira/browse/HADOOP-13403
             Project: Hadoop Common
          Issue Type: Bug
          Components: azure
            Reporter: Subramanyam Pattipaka


WASB Performance Improvements

Problem
-----------
Azure Native File system operations like rename/delete which has large number 
of directories and/or files in the source directory are experiencing 
performance issues. Here are possible reasons
a)      We first list all files under source directory hierarchically. This is 
a serial operation. 
b)      After collecting the entire list of files under a folder, we delete or 
rename files one by one serially.
c)      There is no logging information available for these costly operations 
even in DEBUG mode leading to difficulty in understanding wasb performance 
issues.

Proposal
-------------
Step 1: Rename and delete operations will generate a list all files under the 
source folder. We need to use azure flat listing option to get list with single 
request to azure store. We have introduced config fs.azure.flatlist.enable to 
enable this option. The default value is 'false' which means flat listing is 
disabled.

Step 2: Create thread pool and threads dynamically based on user configuration. 
These thread pools will be deleted after operation is over.  We are introducing 
introducing two new configs
        a)      fs.azure.rename.threads : Config to set number of rename 
threads. Default value is 0 which means no threading.
        b)      fs.azure.delete.threads: Config to set number of delete 
threads. Default value is 0 which means no threading.

        We have provided debug log information on number of threads not used 
for the operation which can be useful .

        Failure Scenarios:
        If we fail to create thread pool due to ANY reason (for example trying 
create with thread count with large value such as 1000000), we fall back to 
serialization operation. 

Step 3: Bob operations can be done in parallel using multiple threads executing 
following snippet
        while ((currentIndex = fileIndex.getAndIncrement()) < files.length) {
                FileMetadata file = files[currentIndex];
                Rename/delete(file);
        }

        The above strategy depends on the fact that all files are stored in a 
final array and each thread has to determine synchronized next index to do the 
job. The advantage of this strategy is that even if user configures large 
number of unusable threads, we always ensure that work doesn’t get serialized 
due to lagging threads. 

        We are logging following information which can be useful for tuning 
number of threads

        a) Number of unusable threads
        b) Time taken by each thread
        c) Number of files processed by each thread
        d) Total time taken for the operation

        Failure Scenarios:

        Failure to queue a thread execute request shouldn’t be an issue if we 
can ensure at least one thread has completed execution successfully. If we 
couldn't schedule one thread then we should take serialization path. Exceptions 
raised while executing threads are still considered regular exceptions and 
returned to client as operation failed. Exceptions raised while stopping 
threads and deleting thread pool shouldn't can be ignored if operation all 
files are done with out any issue.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to