[jira] [Commented] (HADOOP-15471) Hdfs recursive listing operation is very slow
[ https://issues.apache.org/jira/browse/HADOOP-15471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496792#comment-16496792 ] Ajay Sachdev commented on HADOOP-15471: --- Hello Mukul/Yiqun/Rishabh, Please let me know if you have any comments on latest patch. Appreciate your help! Ajay > Hdfs recursive listing operation is very slow > - > > Key: HADOOP-15471 > URL: https://issues.apache.org/jira/browse/HADOOP-15471 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). >Reporter: Ajay Sachdev >Assignee: Ajay Sachdev >Priority: Major > Fix For: 2.7.1 > > Attachments: HDFS-13398.001.patch, HDFS-13398.002.patch, > HDFS-13398.003.patch, parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-15471) Hdfs recursive listing operation is very slow
[ https://issues.apache.org/jira/browse/HADOOP-15471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajay Sachdev updated HADOOP-15471: -- Attachment: HDFS-13398.003.patch > Hdfs recursive listing operation is very slow > - > > Key: HADOOP-15471 > URL: https://issues.apache.org/jira/browse/HADOOP-15471 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). >Reporter: Ajay Sachdev >Assignee: Ajay Sachdev >Priority: Major > Fix For: 2.7.1 > > Attachments: HDFS-13398.001.patch, HDFS-13398.002.patch, > HDFS-13398.003.patch, parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15471) Hdfs recursive listing operation is very slow
[ https://issues.apache.org/jira/browse/HADOOP-15471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491040#comment-16491040 ] Ajay Sachdev commented on HADOOP-15471: --- Attaching the latest patch which has Count multithreaded optimization as well. Also added unit tests to diff. Let me know if there are any more comments. Thanks Ajay[^HDFS-13398.003.patch] > Hdfs recursive listing operation is very slow > - > > Key: HADOOP-15471 > URL: https://issues.apache.org/jira/browse/HADOOP-15471 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). >Reporter: Ajay Sachdev >Assignee: Ajay Sachdev >Priority: Major > Fix For: 2.7.1 > > Attachments: HDFS-13398.001.patch, HDFS-13398.002.patch, > HDFS-13398.003.patch, parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15471) Hdfs recursive listing operation is very slow
[ https://issues.apache.org/jira/browse/HADOOP-15471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479336#comment-16479336 ] Ajay Sachdev commented on HADOOP-15471: --- I have incorporated all items above. I am in process of putting some unit tests as well and testing these out. For number of threads option I have opted for -T in command line argument. The "du" command already had a lower -t option for storage type. > Hdfs recursive listing operation is very slow > - > > Key: HADOOP-15471 > URL: https://issues.apache.org/jira/browse/HADOOP-15471 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). >Reporter: Ajay Sachdev >Assignee: Ajay Sachdev >Priority: Major > Fix For: 2.7.1 > > Attachments: HDFS-13398.001.patch, HDFS-13398.002.patch, > parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Moved] (HADOOP-15471) Hdfs recursive listing operation is very slow
[ https://issues.apache.org/jira/browse/HADOOP-15471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajay Sachdev moved HDFS-13398 to HADOOP-15471: -- Fix Version/s: (was: 2.7.1) 2.7.1 Affects Version/s: (was: 2.7.1) 2.7.1 Target Version/s: 2.7.6 (was: 2.7.1) Component/s: (was: hdfs) fs Key: HADOOP-15471 (was: HDFS-13398) Project: Hadoop Common (was: Hadoop HDFS) > Hdfs recursive listing operation is very slow > - > > Key: HADOOP-15471 > URL: https://issues.apache.org/jira/browse/HADOOP-15471 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 2.7.1 > Environment: HCFS file system where HDP 2.6.1 is connected to ECS > (Object Store). >Reporter: Ajay Sachdev >Assignee: Ajay Sachdev >Priority: Major > Fix For: 2.7.1 > > Attachments: HDFS-13398.001.patch, HDFS-13398.002.patch, > parallelfsPatch > > > The hdfs dfs -ls -R command is sequential in nature and is very slow for a > HCFS system. We have seen around 6 mins for 40K directory/files structure. > The proposal is to use multithreading approach to speed up recursive list, du > and count operations. > We have tried a ForkJoinPool implementation to improve performance for > recursive listing operation. > [https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli] > commit id : > 82387c8cd76c2e2761bd7f651122f83d45ae8876 > Another implementation is to use Java Executor Service to improve performance > to run listing operation in multiple threads in parallel. This has > significantly reduced the time to 40 secs from 6 mins. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org