[jira] Commented: (HDFS-854) Datanode should scan devices in parallel to generate block report

2010-03-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844422#action_12844422
 ] 

Hadoop QA commented on HDFS-854:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12438481/HDFS-854.patch
  against trunk revision 921697.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/268/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/268/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/268/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/268/console

This message is automatically generated.

> Datanode should scan devices in parallel to generate block report
> -
>
> Key: HDFS-854
> URL: https://issues.apache.org/jira/browse/HDFS-854
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: data-node
>Affects Versions: 0.22.0
>Reporter: dhruba borthakur
>Assignee: Dmytro Molkov
> Fix For: 0.22.0
>
> Attachments: HDFS-854.patch
>
>
> A Datanode should scan its disk devices in parallel so that the time to 
> generate a block report is reduced. This will reduce the startup time of a 
> cluster.
> A datanode has 12 disk (each of 1 TB) to store HDFS blocks. There is a total 
> of 150K blocks on these 12 disks. It takes the datanode upto 20 minutes to 
> scan these devices to generate the first block report.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-12 Thread Zlatin Balevsky (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844485#action_12844485
 ] 

Zlatin Balevsky commented on HDFS-918:
--

bq. I think it is very important to have separate pools for each partition

+1

> Use single Selector and small thread pool to replace many instances of 
> BlockSender for reads
> 
>
> Key: HDFS-918
> URL: https://issues.apache.org/jira/browse/HDFS-918
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: data-node
>Reporter: Jay Booth
> Fix For: 0.22.0
>
> Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, 
> hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, 
> hdfs-multiplex.patch
>
>
> Currently, on read requests, the DataXCeiver server allocates a new thread 
> per request, which must allocate its own buffers and leads to 
> higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
> single selector and a small threadpool to multiplex request packets, we could 
> theoretically achieve higher performance while taking up fewer resources and 
> leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
> can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-854) Datanode should scan devices in parallel to generate block report

2010-03-12 Thread Dmytro Molkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmytro Molkov updated HDFS-854:
---

Status: Open  (was: Patch Available)

> Datanode should scan devices in parallel to generate block report
> -
>
> Key: HDFS-854
> URL: https://issues.apache.org/jira/browse/HDFS-854
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: data-node
>Affects Versions: 0.22.0
>Reporter: dhruba borthakur
>Assignee: Dmytro Molkov
> Fix For: 0.22.0
>
> Attachments: HDFS-854.patch
>
>
> A Datanode should scan its disk devices in parallel so that the time to 
> generate a block report is reduced. This will reduce the startup time of a 
> cluster.
> A datanode has 12 disk (each of 1 TB) to store HDFS blocks. There is a total 
> of 150K blocks on these 12 disks. It takes the datanode upto 20 minutes to 
> scan these devices to generate the first block report.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-854) Datanode should scan devices in parallel to generate block report

2010-03-12 Thread Dmytro Molkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmytro Molkov updated HDFS-854:
---

Status: Patch Available  (was: Open)

Resubmitting the same patch, because I could not connect it with the failures 
in the tests.

> Datanode should scan devices in parallel to generate block report
> -
>
> Key: HDFS-854
> URL: https://issues.apache.org/jira/browse/HDFS-854
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: data-node
>Affects Versions: 0.22.0
>Reporter: dhruba borthakur
>Assignee: Dmytro Molkov
> Fix For: 0.22.0
>
> Attachments: HDFS-854.patch
>
>
> A Datanode should scan its disk devices in parallel so that the time to 
> generate a block report is reduced. This will reduce the startup time of a 
> cluster.
> A datanode has 12 disk (each of 1 TB) to store HDFS blocks. There is a total 
> of 150K blocks on these 12 disks. It takes the datanode upto 20 minutes to 
> scan these devices to generate the first block report.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-12 Thread Jay Booth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844596#action_12844596
 ] 

Jay Booth commented on HDFS-918:


.bq I think it is very important to have separate pools for each partition. 
Otherwise, each disk will be accessed only as much as the slowest disk (when DN 
has enough load).

This would be the case if I were using a fixed-size thread pool and a 
LinkedBlockingQueue -- but I'm not, see Executors.newCachedThreadPool(), it's 
actually bounded at Integer.MAX_VALUE threads and uses a SynchronousQueue.  If 
a new thread is needed in order to start work on a task immediately, it's 
created.  Otherwise, an existing waiting thread will be re-used.  (Threads are 
purged if they've been idle for 60 seconds).  Either way, the underlying I/O 
request is dispatched pretty much immediately after the connection is writable. 
 So I don't see why separate pools per partition would help anything, the 
operating system will handle IO requests as it can and put threads into 
runnable state as it can regardless of which pool they're in.

RE: Netty, I'm not very knowledgeable about it beyond the Cliff's Notes 
version, but my code dealing with the Selector is pretty small -- the main loop 
is under 75 lines, and java.util.concurrent does most of the heavy lifting.  
Most of the code is dealing with application and protocol specifics.  So my 
instinct in general is that adding a framework may actually increase the amount 
of code, especially if there's any mismatches between what we're doing and what 
it wants us to do (the packet-header, sums data, main data format is pretty 
specific to us).  Plus, as Todd said, we can't really change the blocking IO 
nature of the main accept() loop in DataXceiverServer without this becoming a 
much bigger patch, although I agree that we should go there in general.  That 
being said, better is better, so if a Netty implementation took up fewer lines 
of code and performed better, then that speaks for itself.

> Use single Selector and small thread pool to replace many instances of 
> BlockSender for reads
> 
>
> Key: HDFS-918
> URL: https://issues.apache.org/jira/browse/HDFS-918
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: data-node
>Reporter: Jay Booth
> Fix For: 0.22.0
>
> Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, 
> hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, 
> hdfs-multiplex.patch
>
>
> Currently, on read requests, the DataXCeiver server allocates a new thread 
> per request, which must allocate its own buffers and leads to 
> higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
> single selector and a small threadpool to multiplex request packets, we could 
> theoretically achieve higher performance while taking up fewer resources and 
> leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
> can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1024) SecondaryNamenode fails to checkpoint because namenode fails with CancelledKeyException

2010-03-12 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844607#action_12844607
 ] 

dhruba borthakur commented on HDFS-1024:


+1 code looks good.

> SecondaryNamenode fails to checkpoint because namenode fails with 
> CancelledKeyException
> ---
>
> Key: HDFS-1024
> URL: https://issues.apache.org/jira/browse/HDFS-1024
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: dhruba borthakur
>Assignee: Dmytro Molkov
> Fix For: 0.22.0
>
> Attachments: HDFS-1024.patch, HDFS-1024.patch.1
>
>
> The secondary namenode fails to retrieve the entire fsimage from the 
> Namenode. It fetches a part of the fsimage but believes that it has fetched 
> the entire fsimage file and proceeds ahead with the checkpointing. Stack 
> traces will be attached below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-12 Thread Raghu Angadi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844620#action_12844620
 ] 

Raghu Angadi commented on HDFS-918:
---

> RE: Netty, I'm not very knowledgeable about it beyond the Cliff's Notes 
> version, but my code dealing with the Selector is pretty small - the main 
> loop is under 75 lines, and java.util.concurrent does most of the heavy 
> lifting

Jay, I think is ok to ignore Netty for this jira. it could be re-factored later.

>> I think it is very important to have separate pools for each partition. 
> This would be the case if I were using a fixed-size thread pool and a 
> LinkedBlockingQueue - but I'm not, see Executors.newCachedThreadPool(),

hmm.. does it mean that if you have thousand clients and the load is disk 
bound, we end up with 1000 threads?


 

> Use single Selector and small thread pool to replace many instances of 
> BlockSender for reads
> 
>
> Key: HDFS-918
> URL: https://issues.apache.org/jira/browse/HDFS-918
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: data-node
>Reporter: Jay Booth
> Fix For: 0.22.0
>
> Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, 
> hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, 
> hdfs-multiplex.patch
>
>
> Currently, on read requests, the DataXCeiver server allocates a new thread 
> per request, which must allocate its own buffers and leads to 
> higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
> single selector and a small threadpool to multiplex request packets, we could 
> theoretically achieve higher performance while taking up fewer resources and 
> leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
> can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-708) A stress-test tool for HDFS.

2010-03-12 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-708:
-

Assignee: (was: Konstantin Shvachko)

> A stress-test tool for HDFS.
> 
>
> Key: HDFS-708
> URL: https://issues.apache.org/jira/browse/HDFS-708
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: test, tools
>Affects Versions: 0.22.0
>Reporter: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: SLiveTest.pdf
>
>
> It would be good to have a tool for automatic stress testing HDFS, which 
> would provide IO-intensive load on HDFS cluster.
> The idea is to start the tool, let it run overnight, and then be able to 
> analyze possible failures.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-918) Use single Selector and small thread pool to replace many instances of BlockSender for reads

2010-03-12 Thread Jay Booth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844658#action_12844658
 ] 

Jay Booth commented on HDFS-918:


>>> I think it is very important to have separate pools for each partition.
>> This would be the case if I were using a fixed-size thread pool and a 
>> LinkedBlockingQueue - but I'm not, see Executors.newCachedThreadPool(),

>hmm.. does it mean that if you have thousand clients and the load is disk 
>bound, we end up with 1000 threads?

Yeah, although it'll likely turn out to be less than 1000 in practice..   If 
the requests are all short-lived, it could be significantly less than 1000 
threads when you consider re-use, if it's 1000 long reads, it'll probably wind 
up being only a little less if at all.  The threads themselves are really 
lightweight, the only resources attached to them are a 
ThreadLocal.   (8k seemed ok for the ByteBuffer because the 
header+checksums portion is always significantly less than that, and the main 
block file transfers are done using transferTo).

I chose this approach after initially experimenting with a fixed-size 
threadpool and LinkedBlockingQueue because the handoff is faster and every 
pending IO request is guaranteed to become an actual disk-read syscall waiting 
on the operating system as fast as possible.  This way, the operating system 
decides which disk request to fulfill first, taking advantage of the 
lower-level optimizations around disk IO.  Since the threads are pretty 
lightweight and the lower-level calls do a better job of optimal fulfillment, I 
think this will work better than a fixed-size threadpool, where for example, 2 
adjacent reads from separate threads could be separated from each other in time 
whereas the disk controller might fulfill both simultaneously and faster.  This 
becomes even more important, I think, with the higher 512kb packet size -- 
those are big chunks of work per-sycall that can be optimized by the underlying 
OS.  Regarding the extra resource allocation for the threads -- if we're 
disk-bound, then generally speaking a few extra memory resources shouldn't be a 
huge deal -- the gains from dispatching more disk requests in parallel should 
outweigh the memory allocation and context switch costs.

The above is all in theory -- I haven't benchmarked parallel implementations 
head-to-head.  But certainly for random reads, and likely for longer reads, 
this approach should get the syscall invoked as fast as possible.  Switching 
between the two models would be pretty simple, just change the parameters we 
pass to the constructor for new ThreadPoolExecutorService().

> Use single Selector and small thread pool to replace many instances of 
> BlockSender for reads
> 
>
> Key: HDFS-918
> URL: https://issues.apache.org/jira/browse/HDFS-918
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: data-node
>Reporter: Jay Booth
> Fix For: 0.22.0
>
> Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, 
> hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, 
> hdfs-multiplex.patch
>
>
> Currently, on read requests, the DataXCeiver server allocates a new thread 
> per request, which must allocate its own buffers and leads to 
> higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
> single selector and a small threadpool to multiplex request packets, we could 
> theoretically achieve higher performance while taking up fewer resources and 
> leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
> can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-854) Datanode should scan devices in parallel to generate block report

2010-03-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844657#action_12844657
 ] 

Hadoop QA commented on HDFS-854:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12438481/HDFS-854.patch
  against trunk revision 921697.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/269/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/269/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/269/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/269/console

This message is automatically generated.

> Datanode should scan devices in parallel to generate block report
> -
>
> Key: HDFS-854
> URL: https://issues.apache.org/jira/browse/HDFS-854
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: data-node
>Affects Versions: 0.22.0
>Reporter: dhruba borthakur
>Assignee: Dmytro Molkov
> Fix For: 0.22.0
>
> Attachments: HDFS-854.patch
>
>
> A Datanode should scan its disk devices in parallel so that the time to 
> generate a block report is reduced. This will reduce the startup time of a 
> cluster.
> A datanode has 12 disk (each of 1 TB) to store HDFS blocks. There is a total 
> of 150K blocks on these 12 disks. It takes the datanode upto 20 minutes to 
> scan these devices to generate the first block report.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-985) HDFS should issue multiple RPCs for listing a large directory

2010-03-12 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844758#action_12844758
 ] 

Suresh Srinivas commented on HDFS-985:
--

Comments for trunk version of the patch:
# I feel throwing an exception instead of returning the accumulated list is a 
better behavior. This will avoid applications using the partial list to query 
further and handle file not found exception. If we continue to return partial 
list, add comments to relevant methods in FileSystem that if a directory is 
deleted, the accumulated list will be returned. 
# Add test cases to test deletion of directory while list status is still 
iterating.
# There are some mapred changes in 20 version of the file that needs to be made 
in mapred branch?


> HDFS should issue multiple RPCs for listing a large directory
> -
>
> Key: HDFS-985
> URL: https://issues.apache.org/jira/browse/HDFS-985
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Hairong Kuang
>Assignee: Hairong Kuang
> Fix For: 0.22.0
>
> Attachments: directoryBrowse_0.20yahoo.patch, 
> iterativeLS_trunk.patch, iterativeLS_trunk1.patch, iterativeLS_trunk2.patch, 
> iterativeLS_yahoo.patch, iterativeLS_yahoo1.patch, testFileStatus.patch
>
>
> Currently HDFS issues one RPC from the client to the NameNode for listing a 
> directory. However some directories are large that contain thousands or 
> millions of items. Listing such large directories in one RPC has a few 
> shortcomings:
> 1. The list operation holds the global fsnamesystem lock for a long time thus 
> blocking other requests. If a large number (like thousands) of such list 
> requests hit NameNode in a short period of time, NameNode will be 
> significantly slowed down. Users end up noticing longer response time or lost 
> connections to NameNode.
> 2. The response message is uncontrollable big. We observed a response as big 
> as 50M bytes when listing a directory of 300 thousand items. Even with the 
> optimization introduced at HDFS-946 that may be able to cut the response by 
> 20-50%, the response size will still in the magnitude of 10 mega bytes.
> I propose to implement a directory listing using multiple RPCs. Here is the 
> plan:
> 1. Each getListing RPC has an upper limit on the number of items returned.  
> This limit could be configurable, but I am thinking to set it to be a fixed 
> number like 500.
> 2. Each RPC additionally specifies a start position for this listing request. 
> I am thinking to use the last item of the previous listing RPC as an 
> indicator. Since NameNode stores all items in a directory as a sorted array, 
> NameNode uses the last item to locate the start item of this listing even if 
> the last item is deleted in between these two consecutive calls. This has the 
> advantage of avoid duplicate entries at the client side.
> 3. The return value additionally specifies if the whole directory is done 
> listing. If the client sees a false flag, it will continue to issue another 
> RPC.
> This proposal will change the semantics of large directory listing in a sense 
> that listing is no longer an atomic operation if a directory's content is 
> changing while the listing operation is in progress.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.