[jira] [Commented] (HDFS-13616) Batch listing of multiple directories

ASF GitHub Bot (Jira) Sun, 04 Dec 2022 02:24:07 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17642970#comment-17642970
 ]


ASF GitHub Bot commented on HDFS-13616:
---------------------------------------

fanlinqian commented on PR #1725:
URL: https://github.com/apache/hadoop/pull/1725#issuecomment-1336370086

   Hello, I encountered a bug when using the batch method, when I input a 
directory with more than 1000 files in it and 2 replications of each file's 
data block, only the first 500 files of this directory are returned and then it 
stops. I think it should be 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
 in getBatchedListing() method to modify, as follows.
    for (; srcsIndex < srcs.length; srcsIndex++) {
                   String src = srcs[srcsIndex];
                   HdfsPartialListing listing;
                   try {
                       DirectoryListing dirListing = getListingInt(dir, pc, 
src, indexStartAfter, needLocation);
                       if (dirListing == null) {
                           throw new FileNotFoundException("Path " + src + " 
does not exist");}
                       listing = new HdfsPartialListing(srcsIndex, 
Lists.newArrayList(dirListing.getPartialListing()));
                       numEntries += listing.getPartialListing().size();
                       lastListing = dirListing;
   
                   } catch (Exception e) {
                       if (e instanceof AccessControlException) {
                           logAuditEvent(false, operationName, src);}
                       listing = new HdfsPartialListing(srcsIndex,
                               new 
RemoteException(e.getClass().getCanonicalName(), e.getMessage()));
                       lastListing = null;
                       LOG.info("Exception listing src {}", src, e);}
                   listings.put(srcsIndex, listing);
   
                 //My modification
                   (lastListing.getRemainingEntries()!=0)
                   {
                        break;
                   }
   
                   if (indexStartAfter.length != 0)
                   {
                       indexStartAfter = new byte[0];
                   }
                   // Terminate if we've reached the maximum listing size
                   if (numEntries >= dir.getListLimit()) {
                       break;
                   }
               }
   The reason for this bug is mainly that the result returned by the 
getListingInt(dir, pc, src, indexStartAfter, needLocation) method will limit 
both the number of files in the directory as well as the number of data blocks 
and replications of the files at the same time. But the getBatchedListing() 
method will only exit the loop if the number of returned results is greater 
than 1000.
   Looking forward to your reply




> Batch listing of multiple directories
> -------------------------------------
>
>                 Key: HDFS-13616
>                 URL: https://issues.apache.org/jira/browse/HDFS-13616
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>    Affects Versions: 3.2.0
>            Reporter: Andrew Wang
>            Assignee: Chao Sun
>            Priority: Major
>             Fix For: 3.3.0
>
>         Attachments: BenchmarkListFiles.java, HDFS-13616.001.patch, 
> HDFS-13616.002.patch
>
>
> One of the dominant workloads for external metadata services is listing of 
> partition directories. This can end up being bottlenecked on RTT time when 
> partition directories contain a small number of files. This is fairly common, 
> since fine-grained partitioning is used for partition pruning by the query 
> engines.
> A batched listing API that takes multiple paths amortizes the RTT cost. 
> Initial benchmarks show a 10-20x improvement in metadata loading performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13616) Batch listing of multiple directories

Reply via email to