Faster directory traversal for raid node
----------------------------------------

                 Key: MAPREDUCE-2167
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2167
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: contrib/raid
            Reporter: Ramkumar Vadali
            Assignee: Ramkumar Vadali


The RaidNode currently iterates over the directory structure to figure out 
which files to RAID. With millions of files, this can take a long time - 
especially if some files are already RAIDed and the RaidNode needs to look at 
parity files / parity file HARs to determine if the file needs to be RAIDed.

The directory traversal is encapsulated inside the class DirectoryTraversal, 
which examines one file at a time, using the caller's thread.

My proposal is to make this multi-threaded as follows:
 * use a pool of threads inside DirectoryTraversal
 * The caller's thread is used to retrieve directories, and each new directory 
is assigned to a thread in the pool. The worker thread examines all the files 
the directory.
 * If there sub-directories, those are added back as workitems to the pool.

Comments?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to