[ https://issues.apache.org/jira/browse/HDFS-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441039#comment-16441039 ]
Mukul Kumar Singh edited comment on HDFS-13398 at 4/17/18 3:52 PM: ------------------------------------------------------------------- Thanks for working on this [~ajaysachdev] and [~jcwik]. Please find my comments as following. 1) The current patch is not applying right now. Can you please rebase the patch over latest trunk ? 2) Also can you please upload the patch with filename as "HDFS-13398.001.patch". Please follow the guidelines for naming the patch as https://wiki.apache.org/hadoop/HowToContribute#Naming_your_patch. 3) Also can the config "fs.threads" can be replaced with a command line argument. I feel that will help in controlling the parallelization for each command. We can certainly have a default value in the case it is not specified. 4) Also it will be great if some unit tests can be added for the patch. was (Author: msingh): Thanks for working on this [~ajaysachdev]. Please find my comments as following. 1) The current patch is not applying right now. Can you please rebase the patch over latest trunk ? 2) Also can you please upload the patch with filename as "HDFS-13398.001.patch". Please follow the guidelines for naming the patch as https://wiki.apache.org/hadoop/HowToContribute#Naming_your_patch. 3) Also can the config "fs.threads" can be replaced with a command line argument. I feel that will help in controlling the parallelization for each command. We can certainly have a default value in the case it is not specified. 4) Also it will be great if some unit tests can be added for the patch. > Hdfs recursive listing operation is very slow > --------------------------------------------- > > Key: HDFS-13398 > URL: https://issues.apache.org/jira/browse/HDFS-13398 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). > Reporter: Ajay Sachdev > Assignee: Ajay Sachdev > Priority: Major > Fix For: 2.7.1 > > Attachments: parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org