[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647345#comment-13647345 ] Suresh Srinivas commented on HDFS-4489: --- I have merged HDFS-4434 back into branch-2. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta Attachments: 4434.optimized.patch The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646745#comment-13646745 ] Suresh Srinivas commented on HDFS-4489: --- Thanks [~nroberts] and [~daryn] for commenting back. bq. 100K files via 100 threads seems like a very small sampling when we're running namespaces well over 100M. I think the only detail that might make performance worse is how well the inode map performs as the bucket chains get longer. If it's a problem we can probably fix it later. 100 threads is quite considerable and matches well with typical big cluster RPC handler count. Also inodeMap size is created as a percentage of total memory. That means it is sized based on the namenode size. I agree that this performance impact should be minimal and we should be able to fix if we find any issues. bq. I did notice that unprotectedConcat appears to leak inodes in the map - it unlinks the concat'ed files but doesn't remove them from the map. Nice catch. Created HDFS-4785. bq. ...so you may want to double check. Yes. I will run through one more review. bq. Might want to correct the misspelling: remvoed AllFromInodesFromMap Will be addressed in another jira. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta Attachments: 4434.optimized.patch The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645624#comment-13645624 ] Suresh Srinivas commented on HDFS-4489: --- [~shv] [~nroberts] [~daryn], please let me know if your concerns are addressed. If I do not hear, I plan on committing this change by today or tomorrow. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta Attachments: 4434.optimized.patch The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645635#comment-13645635 ] Nathan Roberts commented on HDFS-4489: -- Thanks for running some basic performance tests! Looks like minimal impact. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta Attachments: 4434.optimized.patch The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645924#comment-13645924 ] Daryn Sharp commented on HDFS-4489: --- 100K files via 100 threads seems like a very small sampling when we're running namespaces well over 100M. I think the only detail that might make performance worse is how well the inode map performs as the bucket chains get longer. If it's a problem we can probably fix it later. I did notice that unprotectedConcat appears to leak inodes in the map - it unlinks the concat'ed files but doesn't remove them from the map. The business logic for keeping the inode map in sync with the namespace is high enough up the call stack that it makes it a bit tough to prove all delete paths are safe, so you may want to double check. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta Attachments: 4434.optimized.patch The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645932#comment-13645932 ] Daryn Sharp commented on HDFS-4489: --- Might want to correct the misspelling: *remvoed* AllFromInodesFromMap :) Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta Attachments: 4434.optimized.patch The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644880#comment-13644880 ] Suresh Srinivas commented on HDFS-4489: --- Here is NNBench for delete operations (run with 100 threads simultaneously running: ||Opertaions||Elapsed||OpsPerSec||AvgTim|| |10|19243|5196.694902|19| |10|18598|5376.92225|18| |10|17819|5611.987205|17| |10|17953|5570.099705|17| |10|18077|5531.891354|18| |10|17948|5571.651437|17| |10|18080|5530.973451|18| |10|18032|5545.696539|18| |10|18431|5425.641582|18| |10|17735|5638.567804|17| |10|1819|.6 5500|012623 17.7| ||Opertaions||Elapsed||OpsPerSec||AvgTim|| |10|18029|5546.619336|17| |10|18527|5397.527932|18| |10|18164|5505.395287|18| |10|18486|5409.49908|18| |10|18053|5539.24|18| |10|18313|5460.601758|18| |10|18299|5464.779496|18| |10|17878|5593.466831|17| |10|18178|5501.155243|18| |10|18084|5529.750055|18| |10|1820|.1 5494|804057 17.8| Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta Attachments: 4434.optimized.patch The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644898#comment-13644898 ] Suresh Srinivas commented on HDFS-4489: --- Summary of results in the tests: # File dreate tests- perform additional reserved name processing, inode map addition and reserved name check. This is where maximum additional work from the patch is being done. #* In the mirco benchmark by just calling create file related methods, the time went from 19235.8 to 19789.2 roughly 2.8% different. This can be further reduced by turning off map to 1.3%. The patch moves splitting paths into components outside the lock. Based on this, further optimizations are possible that improves throughput by reducing the synchronized sections. The end result with that optimizations can make running times much smaller that what it is today. #* I would also point out that, this is a micro benchmark. The % difference observed in this will be dwarfed by RPC times, network round trip time etc. Also the system will spend time on other operations which should not be affected by this patch. # File delete tests - performs reseved name processing and only inode map deletion. #* There very little difference in bench mark results. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta Attachments: 4434.optimized.patch The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644907#comment-13644907 ] Suresh Srinivas commented on HDFS-4489: --- Given the above tests, here are all the issues that are brought up: # Introducing incompatible change #* This is not a major incompatibility. As I said earlier, creating file or directory /.reserved is not allowed. That said, this should get into 2.0.5 given its main goal is compatibility. # This patch could be destabilizing #* This patch is adding an Inode map and support for path scheme which allows addressing files by inodes. Most of the code added in this patch is to support the new addressing mechanisms and extensive unit tests associated with it. The regular code path should largely be unaffected by this, with exception of adding and deleting entries in inode map. Please bring up any concerns that I might have overlooked. # Performance impact - based on the results, there is a very little performance impact. I have two options: #* The difference observed in microbenchmarks amounts to much smaller difference in a real system. That too only associated with a few write operations such as create. Hence is it acceptable. #* Make further optimizations to reduce synchronized section size based on the mechanism added in this patch. [~nroberts] if you feel this is important, I will undertake the work of optimizing this. [~daryn] also had expressed interest in it. Not sure if he has the bandwidth. Given this, I would like to merge this in branch-2.0.5. I hope concerns expressed by people are addressed. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta Attachments: 4434.optimized.patch The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645090#comment-13645090 ] Tsz Wo (Nicholas), SZE commented on HDFS-4489: -- The performance numbers look good. Since the rpc time is not counted, a small percentage difference is nothing. Beside, the Inode ID feature is very useful. It also helps implementing the Snapshot feature. +1 on merging it to branch-2. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta Attachments: 4434.optimized.patch The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643745#comment-13643745 ] Suresh Srinivas commented on HDFS-4489: --- I ran Slive tests. Even with very small size data written, I could not find perceptible difference between the test runs given any additional time in NN methods is dwarfed by the overall time of calling NN over RPC etc. So I decided to run NNThroughputBenchmark. For folks new to it, it is a micro benchmark that does not use RPC and directly executes operations on the namenode class. Hence it gives comparisons sharply limited to NN method calls alone. I ran NNThroughputBenchmark command run to create 100K files using 100 threads in each iteration, using the command below: {noformat} bin/hadoop jar share/hadoop/hdfs/hadoop-hdfs-2.0.5-SNAPSHOT-tests.jar org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -op create -threads 100 -files 10 -filesPerDir 100 {noformat} *Without this patch:* ||Opertaions||Elapsed||OpsPerSec||AvgTime|| |10| 20327| 4919.565110444237| 20| |10| 19199| 5208.604614823688| 19| |10| 19287| 5184.839529216571| 19| |10| 19128| 5227.9381012128815| 19| |10| 19082| 5240.540823813018| 19| |10| 18785| 5323.396326856535| 18| |10| 18947| 5277.880403230063| 18| |10| 18963| 5273.427200337499| 18| |10| 19206| 5206.706237634073| 19| |10| 19434| 5145.621076463929| 19| |Average|19235.8|5200.851942|18.8| *With this patch:* ||Opertaions||Elapsed||OpsPerSec||AvgTime|| |10| 20104| 4974.134500596896| 19| |10| 19498| 5128.731151913017| 19| |10| 19449| 5141.652527122217| 19| |10| 19530| 5120.327700972863| 19| |10| 20067| 4983.305925150745| 19| |10| 19703| 5075.369233111709| 19| |10| 19595| 5103.342689461598| 19| |10| 19418| 5149.860953754249| 19| |10| 19932| 5017.057997190447| 19| |10| 20596| 4855.311711011847| 20| |Average|19789.2|5054.909439|19.1| *With this patch + an additional change to turn off INodeMap:* ||Opertaions||Elapsed||OpsPerSec||AvgTime|| |10| 19615| 5098.139179199592| 19| |10| 19349| 5168.225748100677| 19| |10| 19136| 5225.752508361204| 19| |10| 19347| 5168.760014472528| 19| |10| 20096| 4976.114649681529| 19| |10| 19248| 5195.344970906068| 19| |10| 18916| 5286.529921759357| 18| |10| 19217| 5203.7258677212885| 19| |10| 20105| 4973.887092762994| 20| |10| 19882| 5029.675082989639| 19| |Average|19491.1|5132.615504|19| Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643839#comment-13643839 ] Suresh Srinivas commented on HDFS-4489: --- I made changes to the code to reuse the byte[][] pathComponents for file creation (made some optimizations in that method. There are other optimizations available in terms of permission checks that I did not venture to do). The throughput with those partial optimizations is: ||Opertaions||Elapsed||OpsPerSec||AvgTime|| |10| 19591| 5104.384666428462| 19| |10| 18969| 5271.759186040382| 18| |10| 19206| 5206.706237634073| 19| |10| 18652| 5361.35535063264| 18| |10| 19218| 5203.455094182537| 19| |10| 19179| 5214.036185411127| 19| |10| 19302| 5180.810278727593| 19| |10| 19388| 5157.829585310501| 19| |10| 19099| 5235.876223886067| 19| |10| 19591| 5104.384666428462| 19| |Average|19219.5|5204.059747|18.8| Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642887#comment-13642887 ] Daryn Sharp commented on HDFS-4489: --- I don't think Nathan and I are questioning the utility of the feature, but need to get a feel for the possible performance impact. _If_ there is a significant degradation then it will delay our adoption of 2.x until it's optimized. I think a good performance test is to create a namespace of 150M paths. Flood the NN with thousands of concurrent file directory add/deletes per second throughout the namespace. Hopefully there is existing benchmark with those properties. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642895#comment-13642895 ] Suresh Srinivas commented on HDFS-4489: --- bq. I think a good performance test is to create a namespace of 150M paths. Flood the NN with thousands of concurrent file directory add/deletes per second throughout the namespace. Hopefully there is existing benchmark with those properties. I think we are talking about hashmap entry addition and deletion during adds and delete of files, other than increased memory. I am not sure I understand the cache pollution part of performance impact, given namenode core objects run into GBs in a large setup. I am currently running some slive tests. But I do not currently have bandwidth to setup a namenode with 150M paths (that would require more than 64GB of JVM heap). Do you have some bandwidth to do these tests? Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642907#comment-13642907 ] Konstantin Shvachko commented on HDFS-4489: --- Suresh, 0.20 is not typo. You should parse it as a sarcasm, sorry. Wire compatibility was a target for many previous releases and the train is still there. We clearly have a disagreement about what should be in the release. Other people may have other opinions. And that is my point. All I ask is to play by the rules. Make a release plan and put it into vote. See bylaws under Release Plan. I'll be glad to discuss your plan. Here you act like its your own branch where you commit what you want and nobody else cares. Does it make sense? Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642996#comment-13642996 ] Suresh Srinivas commented on HDFS-4489: --- bq. Here you act like its your own branch where you commit what you want and nobody else cares. I fail to understand the need for such hostile tone. That said, please look at many small features, improvements and numerous bug fixes that are committed by me and other committers. Also instead of stating your objection to a change as it is big, 150K lines of code etc., it would be great if you can really look at the patch and express more concrete technical concerns related to stability. I have reverted HDFS-4434. I have also responded on the thread related to 2.0.5 on including the features that many have been working for many months. It seems to me that suddenly in past week or so you have decided that stability is the only paramount thing, disregarding all the discussions that have happened. Please see my earlier comment on discussion related to API and wire protocol stability that we sent months ago. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643025#comment-13643025 ] Nathan Roberts commented on HDFS-4489: -- bq. Suresh is willing to do the performance benchmark, but I am trying to understand where you are coming from. Yahoo and FB create very large namespaces by simply buying more memory and increasing the size of the heap. This is not always possible. Some of our namenodes are running at the maximum configuration for the box (maximum memory, maximum heap, near maximum namespace). For these clusters, upgrading to this feature will require new boxes. bq. Do you worry about cache pollution when you create 50K more files? I don't worry about cache pollution when I create 50K more files. What's important is the size of the working set. Inodes are a very popular object within the NN, if inodes make up a significant part of our working set, then it matters. I don't know whether this is the case or not, that's why I think it makes sense to run some benchmarks to make sure we don't see any ill-effects. With the introduction of YARN, the central RM is rarely the bottleneck. Now it's much more common for the NN to be the bottleneck of the cluster, and slowing down the bottleneck always needs to be looked at carefully. bq. Given that the NN heap (many GBs) is so much larger than the cache, does the additional inode and inode-map size impact the overall system performance? Good question. Let's find out. bq. Suresh has argued that a 24GB heap grows by 625MB. I was using the numbers Todd gathered where a 7G heap grew by 600MB. When we looked at one of our key clusters, we calculated something like 7.5% increase. bq. Looking at the growth in memory of this feature as a percentage of the total heap size is a more realistic way of looking at the impact of the growth than the growth of an individual data structure like the inode. Maybe. bq. IMHO, not having an inode-map and inode number was a serious limitation in the original implementation of NN. I am willing to pay for the extra memory given the value inode-id and inode-map brings (as described by suresh in the beginning of this Jira). Permissions, access time, etc added to the memory cost of the the NN and were accepted because of the value they bring. Certainly agree it is a limitation. We just need to make sure we fully quantify all of the costs. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643404#comment-13643404 ] Konstantin Shvachko commented on HDFS-4489: --- hostile tone. I apologize. I guess what I really wanted to say that it is hostile to commit incompatible changes in a stabilization branch before the release plan is proposed. would be great if you can really look at the patch You know I did. Thanks for responding on the thread related to 2.0.5. I understand the plan much better. I appreciate your reverting HDFS-434. There is still an incompatible change HDFS-4296. It is listed in new features for some reason. Do you still need HDFS-4296 once HDFS-434 is reverted? We did not change LayoutVersion since branch 0.23. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643424#comment-13643424 ] Suresh Srinivas commented on HDFS-4489: --- bq. There is still an incompatible change HDFS-4296. It is listed in new features for some reason. It is not incompatible and hence not marked as incompatible in jira or CHANGES.txt. It is currently listed as New Feature in CHANGES.txt. I do not think it should be listed under New Features section (though it does not qualify for Improvement, Bug fix any of that). I will move it to bug fix section. bq. Do you still need HDFS-4296 once HDFS-434 is reverted? It is needed because it corresponds to a layout version reserved in branch-1 for concat. It is not related to HDFS-4434. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642118#comment-13642118 ] Konstantin Shvachko commented on HDFS-4489: --- Posted a request for the bases for porting this to branch 2.0.5 in HDFS-4434. suresh What is the concern? My concern is that you committed incompatible change, which is a new feature and a large change, into the stabilization branch without a vote or a release plan discussed with the community. Being a bad practice in general, I think it is a wrong move now in particular, because people are discussing the stabilization of 2.0.5. This feature totals about 150K of code in patches (counting subtasks only). Not helping stabilization. And you didn't give any reasons for the merge. I would like to ask to revert this merge from branch 2.0.5 and follow the procedures for merging features into new release branches if you decide to proceed. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642185#comment-13642185 ] Suresh Srinivas commented on HDFS-4489: --- bq. My concern is that you committed incompatible change Konstantin, not sure if you looked at the release notes. This change disallows a file or directory name called .reserved under root. That is the only reason why I marked it as incompatible. This is not related to wire or API incompatibility. That said, one of the goal for 2.0.5 is drive towards a state where incompatible changes are not allowed after it. bq. which is a new feature and a large change, into the stabilization branch without a vote or a release plan discussed with the community. I agree that this is a new features. Committers routinely promote changes that they consider are okay to branch-2. I believe this does not add to the instability. Let me know if you disagree based on a code review/testing. Also merging to branch-2 in a lot of cases is done based on a committer's judgement. Please look various other jiras that are merged in without vote thread into branch-2. I do not consider this as a large feature. However for Snapshot feature, I would have brought up that in a release thread. bq. . And you didn't give any reasons for the merge. I think there is enough motivation for the feature posted in the jira. bq. I would like to ask to revert this merge from branch 2.0.5 and follow the procedures for merging features into new release branches if you decide to proceed. I have spent more than 12 hours merging the chain of jiras required and resolving conflict before getting to 4 changes that introduced file id. Is your concern about HDFS-4434 or all the related changes? Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642212#comment-13642212 ] Nathan Roberts commented on HDFS-4489: -- Sorry this is a really late comment but I'd really like to see some performance numbers before and after. While 6.5% increase in overall heap size is not massive, my main concern is the 25% increase in a very core data structure within the NN (1.07G-1.37G in Todd's measurement of INodeFile). This could cause significant cache pollution and therefore could have a very measurable impact on performance. I don't know for sure that it will, but it seems it would be reasonable to verify. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642225#comment-13642225 ] Suresh Srinivas commented on HDFS-4489: --- [~nroberts] What performance test would like to be run with and without this change? Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642242#comment-13642242 ] Suresh Srinivas commented on HDFS-4489: --- I have reverted HDFS-4434 from branch-2. Will post the performance numbers and then commit the change to branch-2, based on that discussion. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642290#comment-13642290 ] Sanjay Radia commented on HDFS-4489: Nathan. A question. Suresh is willing to do the performance benchmark, but I am trying to understand where you are coming from. Yahoo and FB create very large namespaces by simply buying more memory and increasing the size of the heap. Do you worry about cache pollution when you create 50K more files? Given that the NN heap (many GBs) is so much larger than the cache, does the additional inode and inode-map size impact the overall system performance? Suresh has argued that a 24GB heap grows by 625MB. Looking at the growth in memory of this feature as a percentage of the total heap size is a more realistic way of looking at the impact of the growth than the growth of an individual data structure like the inode. IMHO, not having an inode-map and inode number was a serious limitation in the original implementation of NN. I am willing to pay for the extra memory given the value inode-id and inode-map brings (as described by suresh in the beginning of this Jira). Permissions, access time, etc added to the memory cost of the the NN and were accepted because of the value they bring. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642355#comment-13642355 ] Konstantin Shvachko commented on HDFS-4489: --- Suresh whatever reason for incompatibility it should go through approval process. You also committed the LayoutVersion change HDFS-4296. Now it requires an upgrade. one of the goal for 2.0.5 is drive towards a state where incompatible changes are not allowed after it. That was the goal for Hadoop 0.20. I thought the goal for 2.0.5 is stabilization. Also merging to branch-2 in a lot of cases is done based on a committer's judgement. I think it is wrong. Especially for the stabilization release. I think there is enough motivation for the feature posted in the jira. Not arguing about the value of the feature. But about its necessity for 2.0.5 Is your concern about HDFS-4434 or all the related changes? Most of them. I would have reviewed if I had a proper warning. So again why is it necessary for 2.0.5? Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642455#comment-13642455 ] Suresh Srinivas commented on HDFS-4489: --- {quote} That was the goal for Hadoop 0.20. I thought the goal for 2.0.5 is stabilization. {quote} I am not sure if 0.20 is a typo. If it is not, I have hard time parsing that statement. See the previous discussion about 2.0.4-beta (now called 2.0.5) in this thread: http://hadoop.markmail.org/thread/v44nqp466p76jpkj bq. I think it is wrong. Especially for the stabilization release. I disagree. I want to get some of the features I have been working on into this release. I think the goal of this release is to get API and wire compatibility stable. bq. Most of them. I would have reviewed if I had a proper warning. I am not sure what kind of warning you are talking about. HDFS-4434 has been in development for a long time with more than 32 iterations of the patch. bq. So again why is it necessary for 2.0.5? Snapshot and NFS feature depends on this. I would like see it become available in 2.0.5. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li Fix For: 2.0.5-beta The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13640438#comment-13640438 ] Suresh Srinivas commented on HDFS-4489: --- I am planning to push the subtasks of this jira to release 2.0. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633360#comment-13633360 ] Suresh Srinivas commented on HDFS-4489: --- For people who are following this jira, HDFS-4434 is now ready for review and commit. Please provide any feedback you have soon. otherwise the comments that come late will have to be incorporated in a subsequent jira. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628931#comment-13628931 ] Daryn Sharp commented on HDFS-4489: --- bq. {quote}Perhaps ASN.1 encoding the long for the inode id will significantly decrease the memory consumption?{quote} bq. Can you add more details on how this would decrease memory consumption? If the long is encoded as a variable length byte array, it should take a long time to exceed 4-5 bytes. With minimal effort complexity, the memory increase would nominally be cut in half for many deployments. Just a suggestion. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629197#comment-13629197 ] Arpit Agarwal commented on HDFS-4489: - {quote} If the long is encoded as a variable length byte array, it should take a long time to exceed 4-5 bytes. With minimal effort complexity, the memory increase would nominally be cut in half for many deployments. {quote} This would save space when serializing the fsImage. I am not sure if we can reduce in-memory usage below the size of a primitive long since the byte array is an object. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627920#comment-13627920 ] Daryn Sharp commented on HDFS-4489: --- Perhaps ASN.1 encoding the long for the inode id will significantly decrease the memory consumption? Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628050#comment-13628050 ] Suresh Srinivas commented on HDFS-4489: --- bq. Perhaps ASN.1 encoding the long for the inode id will significantly decrease the memory consumption? Can you add more details on how this would decrease memory consumption? BTW inodeID was added as a part of HDFS-4334. See the discussion about how reduce the impact of adding inode ID - https://issues.apache.org/jira/browse/HDFS-4258?focusedCommentId=13508432page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13508432. But I am not sure if that optimization is necessary at the expense of code. Thoughts? Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628057#comment-13628057 ] Kihwal Lee commented on HDFS-4489: -- bq. With this change, it is expected that NN is allocated more memory, say 5%. If this is done I am not sure why users should be told namespace limit is X% worse? In many use cases, allocating more heap may not be a problem since machines typically have more memory available. But if you approach from the view point of owners of existing hardware that was spec'ed to hold certain size of namespace, it can be viewed as a decrease of capacity. I am not saying it is a showstopper. I just felt it should be given more thought. I will review the implementation and try to understand your concerns about more memory efficient design. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628197#comment-13628197 ] Suresh Srinivas commented on HDFS-4489: --- bq. But if you approach from the view point of owners of existing hardware that was spec'ed to hold certain size of namespace, it can be viewed as a decrease of capacity. Again I do not believe anyone runs with NN very tightly configured given the nature garbage collection. That said, to make further progress, the following optimizations can be done: # Initialize the map only when this feature is enabled. Should take away roughly 1/3 of extra memory. # Reuse existing bits in INodeId - https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12618468commentId=13508432. Should take away roughly 1/3 of extra memory. # Use first block ID of the file (after ensuring even empty file has an associated block) as the InodeID. This is very ugly and mixing two abstractions that should not be mixed. I am reluctant to make this optimization. My vote is to keep the code simple, abstractions clean. If folks think the above optimizations is worth pursuing, I will update the patch. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628217#comment-13628217 ] Brandon Li commented on HDFS-4489: -- {quote}I am not saying it is a showstopper. I just felt it should be given more thought. {quote} In many cases, a trade-off is involved with the introduction of a new feature or enhancement. This JIRA was forked from HDFS-4258 and the discussion/experiment has been going on for more than 4 months. As shown in the theory analysis and experiment results, the memory overhead of this change is not significant. It doesn't seems to be worthwhile for now to complicate NameNode code to do the extra optimizations. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626799#comment-13626799 ] Daryn Sharp commented on HDFS-4489: --- Maybe something simple like GridMix to get a rough feel for the overhead of the extra resolution. I don't expect it to be much, but it'd be nice to know. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626868#comment-13626868 ] Kihwal Lee commented on HDFS-4489: -- bq. Please look at the overall increase in memory usage instead of increase over used memory. Your point would be valid only if the overhead was entirely a fixed amount (e.g. GSet). Since the extra memory consumption increases as the size of namespace grows, factoring the arbitrary max heap size into this can be misleading. But I agree that the 9% figure does not have an absolute meaning either. If the inode-to-block ratio is different, the number will be different. For the clusters I have seen, it will be a lower number. The GSet used for InodeID to INode map is also semi-fixed. Is it allocated similarly to BlocksMap? In any case, I would not call this insignificant. We have a namenode which will not work well if we upgrade to a release with this feature since it will need extra 4-6GB for the steady-state operation. Even if it could absorb the extra memory requirement, we would have to tell users that the namespace limit is X% worse. Simply saying the overhead is insignificant won't convince users. We should explain why the benefit from having this feature justifies the overhead. I don't think on/off switch is necessary. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626881#comment-13626881 ] Suresh Srinivas commented on HDFS-4489: --- bq. The GSet used for InodeID to INode map is also semi-fixed. Is it allocated similarly to BlocksMap? Yes. Please see the patch in HDFS-4434. About 1% of heap is used for the GSet. bq. Simply saying the overhead is insignificant won't convince users. We should explain why the benefit from having this feature justifies the overhead. I don't think on/off switch is necessary. I think the assertion here is not overhead is insignificant. Depending on details of how the namespace of a system is laid out, I would think this would be anywhere from 2 to 5%. As far the benefits, in the main description I laid this out: --- This helps in several use cases: # HDFS can evolve to support ID based protocols such as NFS. We plan to add an experimental NFS V3 gateway to HDFS using this mechanism. Will post a github link soon. # InodeID can be used by the tools to track a single instance of a file, for cacheing data or tracking and checking for modification based on INodeID, in tools like distcp. # Path cannot identify a unique instance of a file. This causes issues as described in HDFS-4258 and HDFS-4437. It has also been a requirement of many other jiras such as HDFS-385. # Using InodeID as an identifier instead of path can be more efficient than path bases accesses. --- bq. We have a namenode which will not work well if we upgrade to a release with this feature since it will need extra 4-6GB for the steady-state operation. Even if it could absorb the extra memory requirement, we would have to tell users that the namespace limit is X% worse. Is this because namenode does not have RAM? With this change, it is expected that NN is allocated more memory, say 5%. If this is done I am not sure why users should be told namespace limit is X% worse? My rationale, repeating what I said earlier is, machines are becoming available with more RAM. Adding 5% JVM heap should not be a problem. In fact most of the namenodes are configured with enough head room already and might not even need a change. But if this is a big concern, I am okay making additional change to bring down the memory consumption close to zero. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625918#comment-13625918 ] Daryn Sharp commented on HDFS-4489: --- I've only skimmed this jira, but a 9% increase is fairly substantial for large namespaces. Are there any performance metrics available? Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625969#comment-13625969 ] Suresh Srinivas commented on HDFS-4489: --- bq. 9% increase is fairly substantial for large namespaces. Please look at the overall increase in memory usage instead of increase over used memory. As I said that is close 2.6%. bq. Are there any performance metrics available? I do not see much concern here. In fact I removed the flag to turn this feature on or off. If you think based on the code this is a concern, I could add the flag back. What metrics would you like to see? Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617328#comment-13617328 ] Suresh Srinivas commented on HDFS-4489: --- Any further comments? I plan to wrap up HDFS-4334 soon. If there are no further concerns, I do not plan on optimizing memory further at the expense of code complexity. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616634#comment-13616634 ] Todd Lipcon commented on HDFS-4489: --- Here's the results from the latest patch: h2. Setup - Java 6u31 convigured with a 24gb heap (-Xms24g -Xmx24g) - fsimage is 4.1GB on disk, snapshot from a mid size production cluster which runs both hbase and some MR workloads. - 31249022 files and directories, 26525575 blocks = 57774597 total filesystem objects. In each test, I started the NameNode, waited until it had loaded the image and opened its IPC port, and then used jmap -histo:live, which issues a full GC and reports heap usage statistics. h2. 2.0.3-beta release Total heap: 7069MB Top consumers {code} num #instances #bytes class name -- 1: 38421509 2049194112 [Ljava.lang.Object; 2: 26525179 1485410024 org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo 3: 19134601 1071537656 org.apache.hadoop.hdfs.server.namenode.INodeFile 4: 16228949 753517120 [B 5: 12113580 581451840 org.apache.hadoop.hdfs.server.namenode.INodeDirectory 6: 19135442 484175352 [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; 7: 1621399 403948560 [I 8: 11895039 285480936 java.util.ArrayList 9: 1 268435472 [Lorg.apache.hadoop.hdfs.util.LightWeightGSet$LinkedElement; {code} h2. Patched trunk with the map turned off Total heap: 7528MB (6.5% increase from 2.0.3) Top consumers {code} num #instances #bytes class name -- 1: 38421427 2049187584 [Ljava.lang.Object; 2: 26525179 1485410024 org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo 3: 19134601 1377691272 org.apache.hadoop.hdfs.server.namenode.INodeFile 4: 12113580 775269120 org.apache.hadoop.hdfs.server.namenode.INodeDirectory 5: 16228690 753509864 [B 6: 19135442 484175352 [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; 7: 1654298 384726200 [I 8: 11895040 285480960 java.util.ArrayList 9: 1 268435472 [Lorg.apache.hadoop.hdfs.util.LightWeightGSet$LinkedElement; {code} h2. Patched trunk with the map turned on Total heap: 7696MB (8.9% increase from 2.0) Top consumers {code} num #instances #bytes class name -- 1: 38421429 2049187632 [Ljava.lang.Object; 2: 26525179 1485410024 org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo 3: 19134601 1377691272 org.apache.hadoop.hdfs.server.namenode.INodeFile 4: 12113580 775269120 org.apache.hadoop.hdfs.server.namenode.INodeDirectory 5: 16228746 753515976 [B 6: 19135442 484175352 [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; 7: 1499494 426158720 [I 8: 2 402653216 [Lorg.apache.hadoop.hdfs.util.LightWeightGSet$LinkedElement; 9: 11895040 285480960 java.util.ArrayList {code} I don't think this increased memory is necessarily unacceptable, I just wanted to see true measurement of the overhead instead of hypotheses. It looks like the increased memory cost is about twice what was estimated above. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616645#comment-13616645 ] Suresh Srinivas commented on HDFS-4489: --- [~tlipcon] Thanks for running the tests. I personally am not concerned about this increased memory. If there are others with concerns, I can try reducing memory consumption further at the expense more complex code. Thoughts? Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616652#comment-13616652 ] Suresh Srinivas commented on HDFS-4489: --- BTW my calculations of increased memory is against the total java heap allocated to the process than memory used in old generation alone. That is a better way to quantify the impact on users, right? Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616658#comment-13616658 ] Suresh Srinivas commented on HDFS-4489: --- bq. BTW my calculations of increased memory is against the total java heap allocated to the process than memory used in old generation alone. That is a better way to quantify the impact on users, right? Sorry my previous comments may not be clear to every one. Increases of 625MB from 7069MB to 7696MB is 8.9%, the way I was quantifying was percentage of entire java heap memory. That is 625MB out of 24G, that is 2.6%. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615575#comment-13615575 ] Todd Lipcon commented on HDFS-4489: --- bq. byte[] name - I assume typically ~56 bytes for this. That is (16 bytes object overhead, 8 byte length + bytes that make up file name, say 32) According to your comment here: https://issues.apache.org/jira/browse/HDFS-1110?focusedCommentId=12861548page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12861548 a typical image with ~50M files will only need ~5M unique name byte[] objects, so I think it's unfair to count the above against the inode. I think you're also adding an extra 8 bytes on the arrays -- the array length as I understand it is a field within the 16byte object header (occupying the second half of the klassId field). Regardless, this seems like something that's very easy to test rather than try to solve analytically. Do you have results for the additional memory overhead of this map on a large production image? If it's truly 3-5%, seems reasonably, but I'm afraid it may look closer to 10+% in practice. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615665#comment-13615665 ] Suresh Srinivas commented on HDFS-4489: --- bq. I think you're also adding an extra 8 bytes on the arrays – the array length as I understand it is a field within the 16byte object header (occupying the second half of the klassId field). If you have an authoritative source, please send me that. I cannot understand how 16 byte object header have spare of say possible 8 bytes to track array length. Some of of my previous instrumentation had led me to conclude the the array length is 4 bytes for 32bit JVM and 8 bytes for 64 bit JVM. See discussion here - http://www.javamex.com/tutorials/memory/object_memory_usage.shtml. bq. a typical image with ~50M files will only need ~5M unique name byte[] objects, so I think it's unfair to count the above against the inode. That is a fair point. But my own inodes occupies 1/3rd of java heap is also an approximation and in practice I would think it inodes occupy smaller than that. I would like to run an experiment on a large production image. But I do not have ready access to it and will have to spend time getting to it. Do you have any? bq. but I'm afraid it may look closer to 10+% in practice. I do not think it will be close to 10%, but lets say it is. I do not see much issues with it. When we did some of the optimizations earlier, we were not sure how JVM would do if goes closes to 64G and hence wanted to keep the heap size down. But since then many large installations have successfully, without any issues gone beyond that size. Smaller installations should be able to spare, say, 10% extra heap. But if that is not acceptable, here are the alternatives I see: # Add configuration options to turn this feature off. Not instantiating GSet will reduce the overhead by 1/3rd. This is simple to do. # Make more optimizations at the expense of code complexity. I would like to avoid this. But if it is deemed very important, with some optimizations, we can get it close to 0%. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615677#comment-13615677 ] Todd Lipcon commented on HDFS-4489: --- bq. If you have an authoritative source, please send me that Sure, from the JDK7 source code hotspot/src/share/vm/oops/arrayOop.hpp: {code} // The layout of array Oops is: // // markOop // klassOop // 32 bits if compressed but declared 64 in LP64. // length// shares klass memory or allocated after declared fields. {code} Important to note that the length of arrays is 32-bit, since array.length is an int rather than a long. So given a 64-bit field for klassId, it can use 32-bits for the actual class and 32 bits for the array length. bq. I would like to run an experiment on a large production image. But I do not have ready access to it and will have to spend time getting to it. Do you have any? Yes, I can run the experiment on a large image. Is HDFS-4434's patch ready to apply so I can test it? Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615709#comment-13615709 ] Suresh Srinivas commented on HDFS-4489: --- bq. or allocated after declared fields. Not sure what this means though. HDFS-4434 patch is ready. Thanks in advance for running the tests. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615749#comment-13615749 ] Todd Lipcon commented on HDFS-4489: --- // The _length field is not declared in C++. It is allocated after the // declared nonstatic fields in arrayOopDesc if not compressed, otherwise // it occupies the second half of the _klass field in oopDesc. static int length_offset_in_bytes() { return UseCompressedOops ? klass_gap_offset_in_bytes() : sizeof(arrayOopDesc); } Basically if CompressedOops are on, then klassids are only 32-bits, but there's already a 64-bit field for it, so it just uses the latter 4 bytes for the array length. Otherwise it's an extra 4 bytes that comes after the standard oop header (oopDesc). So, without compressed oops, arrays take 20 bytes base. With them (on by default on heaps 32GB since 6u18 I believe), the array header is the same size as normal objects (16 bytes). Will take a look at loading a big image with that patch now. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615755#comment-13615755 ] Suresh Srinivas commented on HDFS-4489: --- Ran some quick tests for object sizes, using https://github.com/dweiss/java-sizeof (pretty neat stuff!) {code} public static void main(String[] args) { System.out.println(RamUsageEstimator.sizeOf(new Object())); System.out.println(RamUsageEstimator.sizeOf(new Object[0])); System.out.println(RamUsageEstimator.sizeOf(new Object[100])); } {code} With compressed oops on I get: 16 16 416 After turning it off: 16 24 824 Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615760#comment-13615760 ] Todd Lipcon commented on HDFS-4489: --- Neat. I'm setting up those tests now... taking a while to clone/build hadoop onto the right machine that has enough RAM. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614692#comment-13614692 ] Todd Lipcon commented on HDFS-4489: --- bq. Inode size is ~180 bytes and this proposal adds 16-24 bytes per Inode. How is this calculated? I see the following 5 fields: {code} private byte[] name = null; private long permission = 0L; protected INodeDirectory parent = null; protected long modificationTime = 0L; protected long accessTime = 0L; {code} for a total of 40 bytes on a 64-bit JVM. So, adding 16-24 bytes is a pretty substantial new memory use. I agree with ATM that this should go on a branch since it's fairly invasive. Once the branch is working, we can evaluate the benefit of the new feature vs the measured cost (both memory and additional CPU to manage this new structure) Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614694#comment-13614694 ] Todd Lipcon commented on HDFS-4489: --- (I guess I should add the subclass fields, in which case INodeFile has another two 8-byte fields, plus the associated array object for BlockInfo, etc)... but still seems to come in a lot less than 180 bytes. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614721#comment-13614721 ] Suresh Srinivas commented on HDFS-4489: --- bq. for a total of 40 bytes on a 64-bit JVM. So, adding 16-24 bytes is a pretty substantial new memory use. Here are the things that goes into ~180 bytes: INode is an object. It comes with the cost of 16 bytes object header overhead. Members include: # byte[] name - I assume typically ~56 bytes for this. That is (16 bytes object overhead, 8 byte length + bytes that make up file name, say 32) # reference to byte[] name - 8 bytes # long permission at the cost of 8 bytes. # parent reference at 8 bytes cost # modification time at 8 bytes cost # accessTime at 8 bytes cost That is roughly ~112 bytes. Typically most of the INodes are INode files (I will leave the other type of inodes as an exercise). # It has BlockInfo[]. This is again 16 bytes of object, 8 bytes length, say two blocks in a file with two references, with a cost of 40 bytes. # It has long header that adds another 8 bytes. Total ~160 bytes. So it is not very far off and the number I had posted was based on what I had calculated long back. That said, 16-24 might seem like a huge percentage (10 to 15%) of INode size. But what is the amount of memory in NN heap that is allocate for Inodes. Assuming Inodes make up for 1/3, blocks make up for another 1/3, remaining 1/3 for floating garbage, head room etc, the net impact on NN heap is 3 to 5%. That is not far off from the analysis posted above. I believe half of the work is already in trunk. Remaining two jiras need to go in. I believe doing a branch at this point in time is unnecessary work. If you are concerned about memory usage of your installs, I can add a config option and not instantiate the map. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614802#comment-13614802 ] Suresh Srinivas commented on HDFS-4489: --- One more thing I forgot to add. There are many optimizations that are possible to reduce the memory consumed. It comes at the cost of code complexity and not so clean abstractions. I would rather avoid it and go for additional memory given newer machines are coming with more memory, than make the code unclean. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613236#comment-13613236 ] Suresh Srinivas commented on HDFS-4489: --- Given that the subtasks do not break the trunk, I plan to start reviewing individual jiras and committing the patches attached to subtasks. Some of these patches are already committed to trunk. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613240#comment-13613240 ] Aaron T. Myers commented on HDFS-4489: -- Why not do this on a branch? That makes the most sense to me, given that the individual patches themselves don't make a lot of sense when considered individually. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613261#comment-13613261 ] Suresh Srinivas commented on HDFS-4489: --- [~atm] Can you describe which of the individual patches do not make sense to you? I thought the previous comments indicated that it was not clear how the overall design is and how the pieces fit together. Now that this jiras describes the over all motivation, approach being taken, I hope there is more clarity. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613269#comment-13613269 ] Aaron T. Myers commented on HDFS-4489: -- Sorry if I wasn't clear - all the patches make sense to me, it's just that several of them don't really stand on their own, so it seems like we should work on the whole work on a separate branch, get the feature in shape there, and then merge it back to trunk once the whole project is completed. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613293#comment-13613293 ] Suresh Srinivas commented on HDFS-4489: --- bq. all the patches make sense to me, it's just that several of them don't really stand on their own I am not sure I understand this. One of the main reasons for a feature branch (at least for me), while during development, we may break trunk. But in this case that is not the case. I have cleaned up the list of subtasks in this jira. Hopefully the subtasks should make it more clear. Let me add some details about individual jiras and that should help in understanding them better: # HDFS-4334 - Adds unique ID to each INode. # HDFS-4346 - Refactored the code to remove code duplication between INode generation and block ID generation # HDFS-4339 - Persist the INode in fsimage. # HDFS-4434 - Introduce a map of inode ID to inode so that inodeid/fileid can be used as an identifier to address a file Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613338#comment-13613338 ] Suresh Srinivas commented on HDFS-4489: --- bq. At one point in time I believe the INodeID work did indeed break WebHDFS on trunk. That is because of a bug introduced in the change. When I say *break* in my previous comment, it is breaking the functionality because incomplete set of changes where HDFS is not functional and not a bug in the code committed. bq. I don't understand the resistance to doing this on a feature branch. What's the concern with doing so? I am not resisting it, I do not see a need for it. I believe we have two more jiras to go in. Other jiras are already in. I think moving those commits to a separate branch, adding mere two more commits in that branch, calling for merge vote is unnecessary waste of time and I want to avoid it. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613300#comment-13613300 ] Aaron T. Myers commented on HDFS-4489: -- bq. One of the main reasons for a feature branch (at least for me), while during development, we may break trunk. But in this case that is not the case. At one point in time I believe the INodeID work did indeed break WebHDFS on trunk. Another reason for using a development branch is because the feature isn't necessarily complete without certain patches having been committed. The fact that HDFS-4339 (persist INodeIDs in the fsimage) isn't committed yet suggests that this feature won't really work as-intended until that's committed, but yet we've already committed other patches involving INodeIDs to trunk. That doesn't make much sense to me. I don't understand the resistance to doing this on a feature branch. What's the concern with doing so? Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613343#comment-13613343 ] Aaron T. Myers commented on HDFS-4489: -- bq. I am not resisting it, I do not see a need for it. I believe we have two more jiras to go in. Other jiras are already in. I think moving those commits to a separate branch, adding mere two more commits in that branch, calling for merge vote is unnecessary waste of time and I want to avoid it. Alright, go for it. I'll repeat my claim that this work should have been done on a branch to begin with, but c'est la vie. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603642#comment-13603642 ] Suresh Srinivas commented on HDFS-4489: --- h2. Introduction This jira proposes to introduce InodeID in HDFS. InodeID uniquely identifies an instance of an inode such as directory, file, symbolic links etc. independent of the file path name. This helps in several use cases: # HDFS can evolve to support ID based protocols such as NFS. We plan to add an experimental NFS V3 gateway to HDFS using this mechanism. Will post a github link soon. # InodeID can be used by the tools to track a single instance of a file, for cacheing data or tracking and checking for modification based on INodeID, in tools like distcp. # Path cannot identify a unique instance of a file. This causes issues as described in HDFS-4258 and HDFS-4437. It has also been a requirement of many other jiras such as HDFS-385. # Using InodeID as an identifier instead of path can be more efficient than path bases accesses. h4. New Inode identifier It is 64 bit long number and has the following properties: # Number 0 is reserved and would be used for identifying invalid/default value. # Numbers 1-1023 are reserved for some unforeseen future requirements. The InodeID starts from 1024. # An InodeID is never re-used in a single namenode namespace. h3. General overview of the changes required Applications discover the InodeID by getting the FileStatus for an Inode or when an Inode is created or opened for append. FileStatus will be changed to include InodeID. Create and append will be changed to return FileStatus as well. For other APIs that use path (Path or String) to identify a file we have two choices: # *API Change*: Add another variant of the API that uses InodeID to identify a file or add additional parameter InodeID to the API. # *No API Change*: Use the path to send the InodeID and keep the API changes to a minimum. Example, one could use path {{/.inodes/inodesID}}, where .inodes is a reserved name to identify the path that pass InodeID instead of regular path. This similar to /proc used for special purposes on *nix. h4. InodeID to Inode map A new map (based on GSet) will be introduced in the namenode to map a given InodeID to an Inode. This is in addition to existing map of BlockID to BlockInfo. h4. Additional memory consumption Adding all this will require additional memory in the namenode. * 8 byte InodeID into Inode object results in a cost of 8 bytes per Inode. As proposed in HDFS-4258, this can be folded into existing modification and access time. * Introducing InodeID to Inode GSet results in additional memory of 16 bytes per Inode: ** 8 * size of Gset (where size of GSet could be as big as number of Inodes) ** 8 bytes per Inode for a java reference (pointer to next element as required by GSet) Inodes and related objects consume approximately 1/3 of the JVM heap in a system that is full. Inode size is ~180 bytes and this proposal adds 16-24 bytes per Inode. With change the JVM heap needs to be increased by 3% to 4.5%. While further optimizations are possible to reduce this size further, it adds needless code complexity. h4. Security concerns InodeID is an alternate name similar to path. All the existing security mechanism that applies to path (that is ensuring permissions are checked from the root to the Inode) will also be done for InodeID based access. Use InodeID as as an identifier of a file in HDFS protocols and APIs Key: HDFS-4489 URL: https://issues.apache.org/jira/browse/HDFS-4489 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Brandon Li Assignee: Brandon Li The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them: 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437. 2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique. 3. id based protocol support (e.g., NFS) 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira