[jira] Updated: (MAPREDUCE-2167) Faster directory traversal for raid node

2010-11-11 Thread Scott Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Chen updated MAPREDUCE-2167:
--

   Resolution: Fixed
Fix Version/s: 0.22.0
 Hadoop Flags: [Reviewed]
   Status: Resolved  (was: Patch Available)

I just committed this. Thanks Ram.

 Faster directory traversal for raid node
 

 Key: MAPREDUCE-2167
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2167
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: contrib/raid
Reporter: Ramkumar Vadali
Assignee: Ramkumar Vadali
 Fix For: 0.22.0

 Attachments: MAPREDUCE-2167.2.patch, MAPREDUCE-2167.3.patch, 
 MAPREDUCE-2167.4.patch, MAPREDUCE-2167.patch


 The RaidNode currently iterates over the directory structure to figure out 
 which files to RAID. With millions of files, this can take a long time - 
 especially if some files are already RAIDed and the RaidNode needs to look at 
 parity files / parity file HARs to determine if the file needs to be RAIDed.
 The directory traversal is encapsulated inside the class DirectoryTraversal, 
 which examines one file at a time, using the caller's thread.
 My proposal is to make this multi-threaded as follows:
  * use a pool of threads inside DirectoryTraversal
  * The caller's thread is used to retrieve directories, and each new 
 directory is assigned to a thread in the pool. The worker thread examines all 
 the files the directory.
  * If there sub-directories, those are added back as workitems to the pool.
 Comments?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-2167) Faster directory traversal for raid node

2010-11-09 Thread Ramkumar Vadali (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramkumar Vadali updated MAPREDUCE-2167:
---

Attachment: MAPREDUCE-2167.4.patch

Fixed a broken test.

TEST RESULTS:


ant test-patch has the same number of failures as a clean checkout

{code}
 [exec] -1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 4 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] -1 findbugs.  The patch appears to introduce 13 new Findbugs 
warnings.
 [exec]
 [exec] -1 release audit.  The applied patch generated 2 release audit 
warnings (more than the trunk's current 1 warnings).
 [exec]
 [exec] +1 system test framework.  The patch passed system test 
framework compile.
 [exec]
 [exec]
 [exec]
 [exec]
 [exec] 
==
 [exec] 
==
 [exec] Finished build.
 [exec] 
==
 [exec] 
==
 [exec]
 [exec]
{code}

ant test succeeds:

{code}


test-junit:
[junit] WARNING: multiple versions of ant detected in path for junit
[junit]  
jar:file:/home/rvadali/local/external/ant/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:file:/home/rvadali/.ivy2/cache/ant/ant/jars/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running org.apache.hadoop.hdfs.TestRaidDfs
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 47.071 sec
[junit] Running org.apache.hadoop.raid.TestBlockFixer
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 124.583 sec
[junit] Running org.apache.hadoop.raid.TestDirectoryTraversal
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 9.337 sec
[junit] Running org.apache.hadoop.raid.TestErasureCodes
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 24.481 sec
[junit] Running org.apache.hadoop.raid.TestGaloisField
[junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 0.392 sec
[junit] Running org.apache.hadoop.raid.TestHarIndexParser
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.052 sec
[junit] Running org.apache.hadoop.raid.TestRaidFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 4.485 sec
[junit] Running org.apache.hadoop.raid.TestRaidHar
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 71.136 sec
[junit] Running org.apache.hadoop.raid.TestRaidNode
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 471.072 sec
[junit] Running org.apache.hadoop.raid.TestRaidPurge
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 107.828 sec
[junit] Running org.apache.hadoop.raid.TestRaidShell
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 25.714 sec

test:

BUILD SUCCESSFUL
Total time: 15 minutes 6 seconds
{code}


 Faster directory traversal for raid node
 

 Key: MAPREDUCE-2167
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2167
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: contrib/raid
Reporter: Ramkumar Vadali
Assignee: Ramkumar Vadali
 Attachments: MAPREDUCE-2167.2.patch, MAPREDUCE-2167.3.patch, 
 MAPREDUCE-2167.4.patch, MAPREDUCE-2167.patch


 The RaidNode currently iterates over the directory structure to figure out 
 which files to RAID. With millions of files, this can take a long time - 
 especially if some files are already RAIDed and the RaidNode needs to look at 
 parity files / parity file HARs to determine if the file needs to be RAIDed.
 The directory traversal is encapsulated inside the class DirectoryTraversal, 
 which examines one file at a time, using the caller's thread.
 My proposal is to make this multi-threaded as follows:
  * use a pool of threads inside DirectoryTraversal
  * The caller's thread is used to retrieve directories, and each new 
 directory is assigned to a thread in the pool. The worker thread examines all 
 the files the directory.
  * If there sub-directories, those are added back as workitems to the pool.
 Comments?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-2167) Faster directory traversal for raid node

2010-11-08 Thread Ramkumar Vadali (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramkumar Vadali updated MAPREDUCE-2167:
---

Attachment: MAPREDUCE-2167.3.patch

Added a comment explaining the use of the slots semaphore.

 Faster directory traversal for raid node
 

 Key: MAPREDUCE-2167
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2167
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: contrib/raid
Reporter: Ramkumar Vadali
Assignee: Ramkumar Vadali
 Attachments: MAPREDUCE-2167.2.patch, MAPREDUCE-2167.3.patch, 
 MAPREDUCE-2167.patch


 The RaidNode currently iterates over the directory structure to figure out 
 which files to RAID. With millions of files, this can take a long time - 
 especially if some files are already RAIDed and the RaidNode needs to look at 
 parity files / parity file HARs to determine if the file needs to be RAIDed.
 The directory traversal is encapsulated inside the class DirectoryTraversal, 
 which examines one file at a time, using the caller's thread.
 My proposal is to make this multi-threaded as follows:
  * use a pool of threads inside DirectoryTraversal
  * The caller's thread is used to retrieve directories, and each new 
 directory is assigned to a thread in the pool. The worker thread examines all 
 the files the directory.
  * If there sub-directories, those are added back as workitems to the pool.
 Comments?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-2167) Faster directory traversal for raid node

2010-11-05 Thread Ramkumar Vadali (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramkumar Vadali updated MAPREDUCE-2167:
---

Attachment: MAPREDUCE-2167.2.patch

Using a semaphore now to track the active threads. The logic is much simpler 
now.

 Faster directory traversal for raid node
 

 Key: MAPREDUCE-2167
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2167
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: contrib/raid
Reporter: Ramkumar Vadali
Assignee: Ramkumar Vadali
 Attachments: MAPREDUCE-2167.2.patch, MAPREDUCE-2167.patch


 The RaidNode currently iterates over the directory structure to figure out 
 which files to RAID. With millions of files, this can take a long time - 
 especially if some files are already RAIDed and the RaidNode needs to look at 
 parity files / parity file HARs to determine if the file needs to be RAIDed.
 The directory traversal is encapsulated inside the class DirectoryTraversal, 
 which examines one file at a time, using the caller's thread.
 My proposal is to make this multi-threaded as follows:
  * use a pool of threads inside DirectoryTraversal
  * The caller's thread is used to retrieve directories, and each new 
 directory is assigned to a thread in the pool. The worker thread examines all 
 the files the directory.
  * If there sub-directories, those are added back as workitems to the pool.
 Comments?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-2167) Faster directory traversal for raid node

2010-11-02 Thread Ramkumar Vadali (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramkumar Vadali updated MAPREDUCE-2167:
---

Attachment: MAPREDUCE-2167.patch

This patch implements the following fix:
 - the signature of getFilteredFiles() does not change
 - the caller's thread is used to get the next directories
 - a thread pool is used to process the files in the directory

 Faster directory traversal for raid node
 

 Key: MAPREDUCE-2167
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2167
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: contrib/raid
Reporter: Ramkumar Vadali
Assignee: Ramkumar Vadali
 Attachments: MAPREDUCE-2167.patch


 The RaidNode currently iterates over the directory structure to figure out 
 which files to RAID. With millions of files, this can take a long time - 
 especially if some files are already RAIDed and the RaidNode needs to look at 
 parity files / parity file HARs to determine if the file needs to be RAIDed.
 The directory traversal is encapsulated inside the class DirectoryTraversal, 
 which examines one file at a time, using the caller's thread.
 My proposal is to make this multi-threaded as follows:
  * use a pool of threads inside DirectoryTraversal
  * The caller's thread is used to retrieve directories, and each new 
 directory is assigned to a thread in the pool. The worker thread examines all 
 the files the directory.
  * If there sub-directories, those are added back as workitems to the pool.
 Comments?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-2167) Faster directory traversal for raid node

2010-11-02 Thread Ramkumar Vadali (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramkumar Vadali updated MAPREDUCE-2167:
---

Status: Patch Available  (was: Open)

 Faster directory traversal for raid node
 

 Key: MAPREDUCE-2167
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2167
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: contrib/raid
Reporter: Ramkumar Vadali
Assignee: Ramkumar Vadali
 Attachments: MAPREDUCE-2167.patch


 The RaidNode currently iterates over the directory structure to figure out 
 which files to RAID. With millions of files, this can take a long time - 
 especially if some files are already RAIDed and the RaidNode needs to look at 
 parity files / parity file HARs to determine if the file needs to be RAIDed.
 The directory traversal is encapsulated inside the class DirectoryTraversal, 
 which examines one file at a time, using the caller's thread.
 My proposal is to make this multi-threaded as follows:
  * use a pool of threads inside DirectoryTraversal
  * The caller's thread is used to retrieve directories, and each new 
 directory is assigned to a thread in the pool. The worker thread examines all 
 the files the directory.
  * If there sub-directories, those are added back as workitems to the pool.
 Comments?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.