[ https://issues.apache.org/jira/browse/HDFS-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hairong Kuang updated HDFS-985: ------------------------------- Status: Patch Available (was: Open) > HDFS should issue multiple RPCs for listing a large directory > ------------------------------------------------------------- > > Key: HDFS-985 > URL: https://issues.apache.org/jira/browse/HDFS-985 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Hairong Kuang > Assignee: Hairong Kuang > Fix For: 0.22.0 > > Attachments: directoryBrowse_0.20yahoo.patch, > directoryBrowse_0.20yahoo_1.patch, iterativeLS_trunk.patch, > iterativeLS_trunk1.patch, iterativeLS_trunk2.patch, iterativeLS_trunk3.patch, > iterativeLS_yahoo.patch, iterativeLS_yahoo1.patch, testFileStatus.patch > > > Currently HDFS issues one RPC from the client to the NameNode for listing a > directory. However some directories are large that contain thousands or > millions of items. Listing such large directories in one RPC has a few > shortcomings: > 1. The list operation holds the global fsnamesystem lock for a long time thus > blocking other requests. If a large number (like thousands) of such list > requests hit NameNode in a short period of time, NameNode will be > significantly slowed down. Users end up noticing longer response time or lost > connections to NameNode. > 2. The response message is uncontrollable big. We observed a response as big > as 50M bytes when listing a directory of 300 thousand items. Even with the > optimization introduced at HDFS-946 that may be able to cut the response by > 20-50%, the response size will still in the magnitude of 10 mega bytes. > I propose to implement a directory listing using multiple RPCs. Here is the > plan: > 1. Each getListing RPC has an upper limit on the number of items returned. > This limit could be configurable, but I am thinking to set it to be a fixed > number like 500. > 2. Each RPC additionally specifies a start position for this listing request. > I am thinking to use the last item of the previous listing RPC as an > indicator. Since NameNode stores all items in a directory as a sorted array, > NameNode uses the last item to locate the start item of this listing even if > the last item is deleted in between these two consecutive calls. This has the > advantage of avoid duplicate entries at the client side. > 3. The return value additionally specifies if the whole directory is done > listing. If the client sees a false flag, it will continue to issue another > RPC. > This proposal will change the semantics of large directory listing in a sense > that listing is no longer an atomic operation if a directory's content is > changing while the listing operation is in progress. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.