[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-05-02 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647345#comment-13647345
 ] 

Suresh Srinivas commented on HDFS-4489:
---

I have merged HDFS-4434 back into branch-2.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta

 Attachments: 4434.optimized.patch


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-05-01 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646745#comment-13646745
 ] 

Suresh Srinivas commented on HDFS-4489:
---

Thanks [~nroberts] and [~daryn] for commenting back.

bq. 100K files via 100 threads seems like a very small sampling when we're 
running namespaces well over 100M. I think the only detail that might make 
performance worse is how well the inode map performs as the bucket chains get 
longer. If it's a problem we can probably fix it later.
100 threads is quite considerable and matches well with typical big cluster RPC 
handler count. Also inodeMap size is created as a percentage of total memory. 
That means it is sized based on the namenode size. I agree that this 
performance impact should be minimal and we should be able to fix if we find 
any issues.

bq. I did notice that unprotectedConcat appears to leak inodes in the map - it 
unlinks the concat'ed files but doesn't remove them from the map. 
Nice catch. Created HDFS-4785.

bq. ...so you may want to double check.
Yes. I will run through one more review.

bq. Might want to correct the misspelling: remvoed AllFromInodesFromMap
Will be addressed in another jira.


 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta

 Attachments: 4434.optimized.patch


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-30 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645624#comment-13645624
 ] 

Suresh Srinivas commented on HDFS-4489:
---

[~shv] [~nroberts] [~daryn], please let me know if your concerns are addressed. 
If I do not hear, I plan on committing this change by today or tomorrow.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta

 Attachments: 4434.optimized.patch


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-30 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645635#comment-13645635
 ] 

Nathan Roberts commented on HDFS-4489:
--

Thanks for running some basic performance tests! Looks like minimal impact.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta

 Attachments: 4434.optimized.patch


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-30 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645924#comment-13645924
 ] 

Daryn Sharp commented on HDFS-4489:
---

100K files via 100 threads seems like a very small sampling when we're running 
namespaces well over 100M.  I think the only detail that might make performance 
worse is how well the inode map performs as the bucket chains get longer.  If 
it's a problem we can probably fix it later.

I did notice that unprotectedConcat appears to leak inodes in the map - it 
unlinks the concat'ed files but doesn't remove them from the map.  The business 
logic for keeping the inode map in sync with the namespace is high enough up 
the call stack that it makes it a bit tough to prove all delete paths are safe, 
so you may want to double check.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta

 Attachments: 4434.optimized.patch


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-30 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645932#comment-13645932
 ] 

Daryn Sharp commented on HDFS-4489:
---

Might want to correct the misspelling: *remvoed* AllFromInodesFromMap :)

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta

 Attachments: 4434.optimized.patch


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-29 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644880#comment-13644880
 ] 

Suresh Srinivas commented on HDFS-4489:
---

Here is NNBench for delete operations (run with 100 threads simultaneously 
running:
||Opertaions||Elapsed||OpsPerSec||AvgTim||
|10|19243|5196.694902|19|
|10|18598|5376.92225|18|
|10|17819|5611.987205|17|
|10|17953|5570.099705|17|
|10|18077|5531.891354|18|
|10|17948|5571.651437|17|
|10|18080|5530.973451|18|
|10|18032|5545.696539|18|
|10|18431|5425.641582|18|
|10|17735|5638.567804|17|
|10|1819|.6 5500|012623 17.7|

||Opertaions||Elapsed||OpsPerSec||AvgTim||
|10|18029|5546.619336|17|
|10|18527|5397.527932|18|
|10|18164|5505.395287|18|
|10|18486|5409.49908|18|
|10|18053|5539.24|18|
|10|18313|5460.601758|18|
|10|18299|5464.779496|18|
|10|17878|5593.466831|17|
|10|18178|5501.155243|18|
|10|18084|5529.750055|18|
|10|1820|.1 5494|804057 17.8|



 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta

 Attachments: 4434.optimized.patch


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-29 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644898#comment-13644898
 ] 

Suresh Srinivas commented on HDFS-4489:
---

Summary of results in the tests:
# File dreate tests- perform additional reserved name processing, inode map 
addition and reserved name check. This is where maximum additional work from 
the patch is being done.
#* In the mirco benchmark by just calling create file related methods, the time 
went from 19235.8 to 19789.2 roughly 2.8% different. This can be further 
reduced by turning off map to 1.3%. The patch moves splitting paths into 
components outside the lock. Based on this, further optimizations are possible 
that improves throughput by reducing the synchronized sections. The end result 
with that optimizations can make running times much smaller that what it is 
today.
#* I would also point out that, this is a micro benchmark. The % difference 
observed in this will be dwarfed by RPC times, network round trip time etc. 
Also the system will spend time on other operations which should not be 
affected by this patch.
# File delete tests - performs reseved name processing and only inode map 
deletion.
#* There very little difference in bench mark results.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta

 Attachments: 4434.optimized.patch


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-29 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644907#comment-13644907
 ] 

Suresh Srinivas commented on HDFS-4489:
---

Given the above tests, here are all the issues that are brought up:
# Introducing incompatible change
#* This is not a major incompatibility. As I said earlier, creating file or 
directory /.reserved is not allowed. That said, this should get into 2.0.5 
given its main goal is compatibility.
# This patch could be destabilizing
#* This patch is adding an Inode map and support for path scheme which allows 
addressing files by inodes. Most of the code added in this patch is to support 
the new addressing mechanisms and extensive unit tests associated with it. The 
regular code path should largely be unaffected by this, with exception of 
adding and deleting entries in inode map. Please bring up any concerns that I 
might have overlooked.
# Performance impact - based on the results, there is a very little performance 
impact. I have two options:
#* The difference observed in microbenchmarks amounts to much smaller 
difference in a real system. That too only associated with a few write 
operations such as create. Hence is it acceptable.
#* Make further optimizations to reduce synchronized section size based on the 
mechanism added in this patch. [~nroberts] if you feel this is important, I 
will undertake the work of optimizing this. [~daryn] also had expressed 
interest in it. Not sure if he has the bandwidth.

Given this, I would like to merge this in branch-2.0.5. I hope concerns 
expressed by people are addressed.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta

 Attachments: 4434.optimized.patch


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-29 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645090#comment-13645090
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-4489:
--

The performance numbers look good.  Since the rpc time is not counted, a small 
percentage difference is nothing.  Beside, the Inode ID feature is very useful. 
 It also helps implementing the Snapshot feature.

+1 on merging it to branch-2.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta

 Attachments: 4434.optimized.patch


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-27 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643745#comment-13643745
 ] 

Suresh Srinivas commented on HDFS-4489:
---

I ran Slive tests. Even with very small size data written, I could not find 
perceptible difference between the test runs given any additional time in NN 
methods is dwarfed by the overall time of calling NN over RPC etc.

So I decided to run NNThroughputBenchmark. For folks new to it, it is a micro 
benchmark that does not use RPC and directly executes operations on the 
namenode class. Hence it gives comparisons sharply limited to NN method calls 
alone. I ran NNThroughputBenchmark command run to create 100K files using 100 
threads in each iteration, using the command below:
{noformat}
bin/hadoop jar share/hadoop/hdfs/hadoop-hdfs-2.0.5-SNAPSHOT-tests.jar 
org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -op create 
-threads 100 -files 10 -filesPerDir 100 
{noformat}

*Without this patch:*
||Opertaions||Elapsed||OpsPerSec||AvgTime||
|10| 20327| 4919.565110444237| 20|
|10| 19199| 5208.604614823688| 19|
|10| 19287| 5184.839529216571| 19|
|10| 19128| 5227.9381012128815| 19|
|10| 19082| 5240.540823813018| 19|
|10| 18785| 5323.396326856535| 18|
|10| 18947| 5277.880403230063| 18|
|10| 18963| 5273.427200337499| 18|
|10| 19206| 5206.706237634073| 19|
|10| 19434| 5145.621076463929| 19|
|Average|19235.8|5200.851942|18.8|

*With this patch:*
||Opertaions||Elapsed||OpsPerSec||AvgTime||
|10| 20104| 4974.134500596896| 19|
|10| 19498| 5128.731151913017| 19|
|10| 19449| 5141.652527122217| 19|
|10| 19530| 5120.327700972863| 19|
|10| 20067| 4983.305925150745| 19|
|10| 19703| 5075.369233111709| 19|
|10| 19595| 5103.342689461598| 19|
|10| 19418| 5149.860953754249| 19|
|10| 19932| 5017.057997190447| 19|
|10| 20596| 4855.311711011847| 20|
|Average|19789.2|5054.909439|19.1|

*With this patch + an additional change to turn off INodeMap:*
||Opertaions||Elapsed||OpsPerSec||AvgTime||
|10| 19615| 5098.139179199592| 19|
|10| 19349| 5168.225748100677| 19|
|10| 19136| 5225.752508361204| 19|
|10| 19347| 5168.760014472528| 19|
|10| 20096| 4976.114649681529| 19|
|10| 19248| 5195.344970906068| 19|
|10| 18916| 5286.529921759357| 18|
|10| 19217| 5203.7258677212885| 19|
|10| 20105| 4973.887092762994| 20|
|10| 19882| 5029.675082989639| 19|
|Average|19491.1|5132.615504|19|


 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-27 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643839#comment-13643839
 ] 

Suresh Srinivas commented on HDFS-4489:
---

I made changes to the code to reuse the byte[][] pathComponents for file 
creation (made some optimizations in that method. There are other optimizations 
available in terms of permission checks that I did not venture to do). The 
throughput with those partial optimizations is:
||Opertaions||Elapsed||OpsPerSec||AvgTime||
|10| 19591| 5104.384666428462| 19|
|10| 18969| 5271.759186040382| 18|
|10| 19206| 5206.706237634073| 19|
|10| 18652| 5361.35535063264| 18|
|10| 19218| 5203.455094182537| 19|
|10| 19179| 5214.036185411127| 19|
|10| 19302| 5180.810278727593| 19|
|10| 19388| 5157.829585310501| 19|
|10| 19099| 5235.876223886067| 19|
|10| 19591| 5104.384666428462| 19|
|Average|19219.5|5204.059747|18.8|


 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-26 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642887#comment-13642887
 ] 

Daryn Sharp commented on HDFS-4489:
---

I don't think Nathan and I are questioning the utility of the feature, but need 
to get a feel for the possible performance impact.  _If_ there is a significant 
degradation then it will delay our adoption of 2.x until it's optimized.

I think a good performance test is to create a namespace of 150M paths.  Flood 
the NN with thousands of concurrent file  directory add/deletes per second 
throughout the namespace.  Hopefully there is existing benchmark with those 
properties.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-26 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642895#comment-13642895
 ] 

Suresh Srinivas commented on HDFS-4489:
---

bq. I think a good performance test is to create a namespace of 150M paths. 
Flood the NN with thousands of concurrent file  directory add/deletes per 
second throughout the namespace. Hopefully there is existing benchmark with 
those properties.
I think we are talking about hashmap entry addition and deletion during adds 
and delete of files, other than increased memory. I am not sure I understand 
the cache pollution part of performance impact, given namenode core objects run 
into GBs in a large setup.

I am currently running some slive tests. But I do not currently have bandwidth 
to setup a namenode with 150M paths (that would require more than 64GB of JVM 
heap). Do you have some bandwidth to do these tests? 

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-26 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642907#comment-13642907
 ] 

Konstantin Shvachko commented on HDFS-4489:
---

Suresh,
0.20 is not  typo. You should parse it as a sarcasm, sorry. Wire compatibility 
was a target for many previous releases and the train is still there.
We clearly have a disagreement about what should be in the release. Other 
people may have other opinions. And that is my point.
All I ask is to play by the rules. Make a release plan and put it into vote. 
See bylaws under Release Plan. I'll be glad to discuss your plan.
Here you act like its your own branch where you commit what you want and nobody 
else cares.
Does it make sense?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-26 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642996#comment-13642996
 ] 

Suresh Srinivas commented on HDFS-4489:
---

bq. Here you act like its your own branch where you commit what you want and 
nobody else cares.
I fail to understand the need for such hostile tone. That said, please look at 
many small features, improvements and numerous bug fixes that are committed by 
me and other committers. Also instead of stating your objection to a change as 
it is big, 150K lines of code etc., it would be great if you can really look at 
the patch and express more concrete technical concerns related to stability.

I have reverted HDFS-4434. I have also responded on the thread related to 2.0.5 
on including the features that many have been working for many months.

It seems to me that suddenly in past week or so you have decided that stability 
is the only paramount thing, disregarding all the discussions that have 
happened. Please see my earlier comment on discussion related to API and wire 
protocol stability that we sent months ago.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-26 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643025#comment-13643025
 ] 

Nathan Roberts commented on HDFS-4489:
--

bq. Suresh is willing to do the performance benchmark, but I am trying to 
understand where you are coming from. Yahoo and FB create very large namespaces 
by simply buying more memory and increasing the size of the heap. 

This is not always possible. Some of our namenodes are running at the maximum 
configuration for the box (maximum memory, maximum heap, near maximum 
namespace). For these clusters, upgrading to this feature will require new 
boxes. 

bq. Do you worry about cache pollution when you create 50K more files? 
I don't worry about cache pollution when I create 50K more files. What's 
important is the size of the working set. Inodes are a very popular object 
within the NN, if inodes make up a significant part of our working set, then it 
matters. I don't know whether this is the case or not, that's why I think it 
makes sense to run some benchmarks to make sure we don't see any ill-effects. 
With the introduction of YARN, the central RM is rarely the bottleneck. Now 
it's much more common for the NN to be the bottleneck of the cluster, and 
slowing down the bottleneck always needs to be looked at carefully.

bq. Given that the NN heap (many GBs) is so much larger than the cache, does 
the additional inode and inode-map size impact the overall system performance? 
Good question. Let's find out.

bq. Suresh has argued that a 24GB heap grows by 625MB. 
I was using the numbers Todd gathered where a 7G heap grew by 600MB. When we 
looked at one of our key clusters, we calculated something like 7.5% increase.

bq. Looking at the growth in memory of this feature as a percentage of the 
total heap size is a more realistic way of looking at the impact of the growth 
than the growth of an individual data structure like the inode.
Maybe.   


bq. IMHO, not having an inode-map and inode number was a serious limitation in 
the original implementation of NN. I am willing to pay for the extra memory 
given the value inode-id and inode-map brings (as described by suresh in the 
beginning of this Jira). Permissions, access time, etc added to the memory cost 
of the the NN and were accepted because of the value they bring. 
Certainly agree it is a limitation. We just need to make sure we fully quantify 
all of the costs.  


 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-26 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643404#comment-13643404
 ] 

Konstantin Shvachko commented on HDFS-4489:
---

 hostile tone.

I apologize.
I guess what I really wanted to say that it is hostile to commit incompatible 
changes in a stabilization branch before the release plan is proposed.

  would be great if you can really look at the patch

You know I did.
Thanks for responding on the thread related to 2.0.5. I understand the plan 
much better.
I appreciate your reverting HDFS-434.

There is still an incompatible change HDFS-4296. It is listed in new features 
for some reason.
Do you still need HDFS-4296 once HDFS-434 is reverted?
We did not change LayoutVersion since branch 0.23.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-26 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643424#comment-13643424
 ] 

Suresh Srinivas commented on HDFS-4489:
---

bq. There is still an incompatible change HDFS-4296. It is listed in new 
features for some reason.
It is not incompatible and hence not marked as incompatible in jira or 
CHANGES.txt. It is currently listed as New Feature in CHANGES.txt. I do not 
think it should be listed under New Features section (though it does not 
qualify for Improvement, Bug fix any of that). I will move it to bug fix 
section.

bq. Do you still need HDFS-4296 once HDFS-434 is reverted?
It is needed because it corresponds to a layout version reserved in branch-1 
for concat. It is not related to HDFS-4434.


 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-25 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642118#comment-13642118
 ] 

Konstantin Shvachko commented on HDFS-4489:
---

Posted a request for the bases for porting this to branch 2.0.5 in HDFS-4434.

suresh What is the concern?

My concern is that you committed incompatible change, which is a new feature 
and a large change, into the stabilization branch without a vote or a release 
plan discussed with the community.
Being a bad practice in general, I think it is a wrong move now in particular, 
because people are discussing the stabilization of 2.0.5.
This feature totals about 150K of code in patches (counting subtasks only). Not 
helping stabilization. And you didn't give any reasons for the merge.

I would like to ask to revert this merge from branch 2.0.5 and follow the 
procedures for merging features into new release branches if you decide to 
proceed.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-25 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642185#comment-13642185
 ] 

Suresh Srinivas commented on HDFS-4489:
---

bq. My concern is that you committed incompatible change
Konstantin, not sure if you looked at the release notes. This change disallows 
a file or directory name called .reserved under root. That is the only reason 
why I marked it as incompatible. This is not related to wire or API 
incompatibility. That said, one of the goal for 2.0.5 is drive towards a state 
where incompatible changes are not allowed after it.

bq. which is a new feature and a large change, into the stabilization branch 
without a vote or a release plan discussed with the community.
I agree that this is a new features. Committers routinely promote changes that 
they consider are okay to branch-2. I believe this does not add to the 
instability. Let me know if you disagree based on a code review/testing.

Also merging to branch-2 in a lot of cases is done based on a committer's 
judgement. Please look various other jiras that are merged in without vote 
thread into branch-2. I do not consider this as a large feature. However for 
Snapshot feature, I would have brought up that in a release thread.

bq. . And you didn't give any reasons for the merge.
I think there is enough motivation for the feature posted in the jira. 

bq. I would like to ask to revert this merge from branch 2.0.5 and follow the 
procedures for merging features into new release branches if you decide to 
proceed.
I have spent more than 12 hours merging the chain of jiras required and 
resolving conflict before getting to 4 changes that introduced file id. Is your 
concern about HDFS-4434 or all the related changes?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-25 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642212#comment-13642212
 ] 

Nathan Roberts commented on HDFS-4489:
--

Sorry this is a really late comment but I'd really like to see some performance 
numbers before and after. While 6.5% increase in overall heap size is not 
massive, my main concern is the 25% increase in a very core data structure 
within the NN (1.07G-1.37G in Todd's measurement of INodeFile). This could 
cause significant cache pollution and therefore could have a very measurable 
impact on performance. I don't know for sure that it will, but it seems it 
would be reasonable to verify.



 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-25 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642225#comment-13642225
 ] 

Suresh Srinivas commented on HDFS-4489:
---

[~nroberts] What performance test would like to be run with and without this 
change?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-25 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642242#comment-13642242
 ] 

Suresh Srinivas commented on HDFS-4489:
---

I have reverted HDFS-4434 from branch-2. Will post the performance numbers and 
then commit the change to branch-2, based on that discussion.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-25 Thread Sanjay Radia (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642290#comment-13642290
 ] 

Sanjay Radia commented on HDFS-4489:


Nathan. A question.
Suresh is willing to do the performance benchmark, but I am trying to 
understand where you are coming from.  Yahoo and FB create very large 
namespaces by simply buying more memory and increasing the size of the heap. Do 
you worry about cache pollution when you create 50K more files? Given that the 
NN heap (many GBs) is so much larger than the cache, does the additional inode 
and inode-map size impact the overall system performance? Suresh has argued 
that a 24GB heap grows by 625MB. Looking at the growth in memory of this 
feature as a percentage of the total heap size is a more realistic way of 
looking at the impact of the growth than the growth of an individual data 
structure like the inode.

IMHO, not having an inode-map and inode number was a serious limitation in the 
original implementation of NN. I am willing to pay for the extra memory given 
the value inode-id and inode-map brings (as described by suresh in the 
beginning of this Jira). Permissions, access time, etc   added to the memory 
cost of the the NN and were accepted because of the value they bring. 




 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-25 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642355#comment-13642355
 ] 

Konstantin Shvachko commented on HDFS-4489:
---

Suresh whatever reason for incompatibility it should go through approval 
process.
You also committed the LayoutVersion change HDFS-4296. Now it requires an 
upgrade.

 one of the goal for 2.0.5 is drive towards a state where incompatible changes 
 are not allowed after it.

That was the goal for Hadoop 0.20.
I thought the goal for 2.0.5 is stabilization.

 Also merging to branch-2 in a lot of cases is done based on a committer's 
 judgement.

I think it is wrong. Especially for the stabilization release.

 I think there is enough motivation for the feature posted in the jira. 

Not arguing about the value of the feature. But about its necessity for 2.0.5

 Is your concern about HDFS-4434 or all the related changes?

Most of them. I would have reviewed if I had a proper warning.
So again why is it necessary for 2.0.5?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-25 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642455#comment-13642455
 ] 

Suresh Srinivas commented on HDFS-4489:
---

{quote}
That was the goal for Hadoop 0.20.
I thought the goal for 2.0.5 is stabilization.
{quote}
I am not sure if 0.20 is a typo. If it is not, I have hard time parsing that 
statement. See the previous discussion about 2.0.4-beta (now called 2.0.5) in 
this thread:
http://hadoop.markmail.org/thread/v44nqp466p76jpkj


bq. I think it is wrong. Especially for the stabilization release.
I disagree. I want to get some of the features I have been working on into this 
release. I think the goal of this release is to get API and wire compatibility 
stable.

bq. Most of them. I would have reviewed if I had a proper warning.
I am not sure what kind of warning you are talking about. HDFS-4434 has been in 
development for a long time with more than 32 iterations of the patch.

bq. So again why is it necessary for 2.0.5?
Snapshot and NFS feature depends on this. I would like see it become available 
in 2.0.5.


 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-24 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13640438#comment-13640438
 ] 

Suresh Srinivas commented on HDFS-4489:
---

I am planning to push the subtasks of this jira to release 2.0.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-16 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633360#comment-13633360
 ] 

Suresh Srinivas commented on HDFS-4489:
---

For people who are following this jira, HDFS-4434 is now ready for review and 
commit. Please provide any feedback you have soon. otherwise the comments that 
come late will have to be incorporated in a subsequent jira.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-11 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628931#comment-13628931
 ] 

Daryn Sharp commented on HDFS-4489:
---

bq. {quote}Perhaps ASN.1 encoding the long for the inode id will significantly 
decrease the memory consumption?{quote}
bq. Can you add more details on how this would decrease memory consumption?

If the long is encoded as a variable length byte array, it should take a long 
time to exceed 4-5 bytes.  With minimal effort  complexity, the memory 
increase would nominally be cut in half for many deployments.  Just a 
suggestion.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-11 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629197#comment-13629197
 ] 

Arpit Agarwal commented on HDFS-4489:
-

{quote}
If the long is encoded as a variable length byte array, it should take a long 
time to exceed 4-5 bytes. With minimal effort  complexity, the memory increase 
would nominally be cut in half for many deployments.
{quote}
This would save space when serializing the fsImage. I am not sure if we can 
reduce in-memory usage below the size of a primitive long since the byte array 
is an object.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-10 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627920#comment-13627920
 ] 

Daryn Sharp commented on HDFS-4489:
---

Perhaps ASN.1 encoding the long for the inode id will significantly decrease 
the memory consumption?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-10 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628050#comment-13628050
 ] 

Suresh Srinivas commented on HDFS-4489:
---

bq. Perhaps ASN.1 encoding the long for the inode id will significantly 
decrease the memory consumption?
Can you add more details on how this would decrease memory consumption? BTW 
inodeID was added as a part of HDFS-4334. See the discussion about how reduce 
the impact of adding inode ID - 
https://issues.apache.org/jira/browse/HDFS-4258?focusedCommentId=13508432page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13508432.

But I am not sure if that optimization is necessary at the expense of code. 
Thoughts?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-10 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628057#comment-13628057
 ] 

Kihwal Lee commented on HDFS-4489:
--

bq. With this change, it is expected that NN is allocated more memory, say 5%. 
If this is done I am not sure why users should be told namespace limit is X% 
worse?

In many use cases, allocating more heap may not be a problem since machines 
typically have more memory available. But if you approach from the view point 
of owners of existing hardware that was spec'ed to hold certain size of 
namespace, it can be viewed as a decrease of capacity. I am not saying it is a 
showstopper. I just felt it should be given more thought.  

I will review the implementation and try to understand your concerns about more 
memory efficient design.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-10 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628197#comment-13628197
 ] 

Suresh Srinivas commented on HDFS-4489:
---

bq. But if you approach from the view point of owners of existing hardware that 
was spec'ed to hold certain size of namespace, it can be viewed as a decrease 
of capacity.
Again I do not believe anyone runs with NN very tightly configured given the 
nature garbage collection. That said, to make further progress, the following 
optimizations can be done:

# Initialize the map only when this feature is enabled. Should take away 
roughly 1/3 of extra memory.
# Reuse existing bits in INodeId - 
https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12618468commentId=13508432.
 Should take away roughly 1/3 of extra memory.
# Use first block ID of the file (after ensuring even empty file has an 
associated block) as the InodeID. This is very ugly and mixing two abstractions 
that should not be mixed. I am reluctant to make this optimization.

My vote is to keep the code simple, abstractions clean. If folks think the 
above optimizations is worth pursuing, I will update the patch.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-10 Thread Brandon Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628217#comment-13628217
 ] 

Brandon Li commented on HDFS-4489:
--

{quote}I am not saying it is a showstopper. I just felt it should be given more 
thought. {quote}
In many cases, a trade-off is involved with the introduction of a new feature 
or enhancement. 
This JIRA was forked from HDFS-4258 and the discussion/experiment has been 
going on for more than 4 months. 

As shown in the theory analysis and experiment results, the memory overhead of 
this change is not significant. It doesn't seems to be worthwhile for now to 
complicate NameNode code to do the extra optimizations. 

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-09 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626799#comment-13626799
 ] 

Daryn Sharp commented on HDFS-4489:
---

Maybe something simple like GridMix to get a rough feel for the overhead of the 
extra resolution.  I don't expect it to be much, but it'd be nice to know.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-09 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626868#comment-13626868
 ] 

Kihwal Lee commented on HDFS-4489:
--

bq. Please look at the overall increase in memory usage instead of increase 
over used memory. 
Your point would be valid only if the overhead was entirely a fixed amount 
(e.g. GSet).  Since the extra memory consumption increases as the size of 
namespace grows, factoring the arbitrary max heap size into this can be 
misleading.  But I agree that the 9% figure does not have an absolute meaning 
either. If the inode-to-block ratio is different, the number will be different. 
For the clusters I have seen, it will be a lower number. The GSet used for 
InodeID to INode map is also semi-fixed. Is it allocated similarly to 
BlocksMap? 

In any case, I would not call this insignificant. We have a namenode which will 
not work well if we upgrade to a release with this feature since it will need 
extra 4-6GB for the steady-state operation. Even if it could absorb the extra 
memory requirement, we would have to tell users that the namespace limit is X% 
worse.  

Simply saying the overhead is insignificant won't convince users. We should 
explain why the benefit from having this feature justifies the overhead.  I 
don't think on/off switch is necessary. 

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-09 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626881#comment-13626881
 ] 

Suresh Srinivas commented on HDFS-4489:
---

bq. The GSet used for InodeID to INode map is also semi-fixed. Is it allocated 
similarly to BlocksMap?
Yes. Please see the patch in HDFS-4434. About 1% of heap is used for the GSet.

bq. Simply saying the overhead is insignificant won't convince users. We should 
explain why the benefit from having this feature justifies the overhead. I 
don't think on/off switch is necessary.
I think the assertion here is not overhead is insignificant. Depending on 
details of how the namespace of a system is laid out, I would think this would 
be anywhere from 2 to 5%.

As far the benefits, in the main description I laid this out:

---
This helps in several use cases:
# HDFS can evolve to support ID based protocols such as NFS. We plan to add an 
experimental NFS V3 gateway to HDFS using this mechanism. Will post a github 
link soon.
# InodeID can be used by the tools to track a single instance of a file, for 
cacheing data or tracking and checking for modification based on INodeID, in 
tools like distcp.
# Path cannot identify a unique instance of a file. This causes issues as 
described in HDFS-4258 and HDFS-4437. It has also been a requirement of many 
other jiras such as HDFS-385.
# Using InodeID as an identifier instead of path can be more efficient than 
path bases accesses.
---

bq. We have a namenode which will not work well if we upgrade to a release with 
this feature since it will need extra 4-6GB for the steady-state operation. 
Even if it could absorb the extra memory requirement, we would have to tell 
users that the namespace limit is X% worse.
Is this because namenode does not have RAM? With this change, it is expected 
that NN is allocated more memory, say 5%. If this is done I am not sure why 
users should be told namespace limit is X% worse?

My rationale, repeating what I said earlier is,  machines are becoming 
available with more RAM. Adding 5% JVM heap should not be a problem. In fact 
most of the namenodes are configured with enough head room already and might 
not even need a change. But if this is a big concern, I am okay making 
additional change to bring down the memory consumption close to zero. 



 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-08 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625918#comment-13625918
 ] 

Daryn Sharp commented on HDFS-4489:
---

I've only skimmed this jira, but a 9% increase is fairly substantial for large 
namespaces.  Are there any performance metrics available?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-08 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625969#comment-13625969
 ] 

Suresh Srinivas commented on HDFS-4489:
---

bq. 9% increase is fairly substantial for large namespaces.
Please look at the overall increase in memory usage instead of increase over 
used memory. As I said that is close 2.6%.

bq. Are there any performance metrics available?
I do not see much concern here. In fact I removed the flag to turn this feature 
on or off. If you think based on the code this is a concern, I could add the 
flag back. What metrics would you like to see?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-29 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617328#comment-13617328
 ] 

Suresh Srinivas commented on HDFS-4489:
---

Any further comments? I plan to wrap up HDFS-4334 soon. If there are no further 
concerns, I do not plan on optimizing memory further at the expense of code 
complexity.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-28 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616634#comment-13616634
 ] 

Todd Lipcon commented on HDFS-4489:
---

Here's the results from the latest patch:

h2. Setup
- Java 6u31 convigured with a 24gb heap (-Xms24g -Xmx24g)
- fsimage is 4.1GB on disk, snapshot from a mid size production cluster which 
runs both hbase and some MR workloads.
- 31249022 files and directories, 26525575 blocks = 57774597 total filesystem 
objects.

In each test, I started the NameNode, waited until it had loaded the image and 
opened its IPC port, and then used jmap -histo:live, which issues a full GC 
and reports heap usage statistics.

h2. 2.0.3-beta release
Total heap: 7069MB

Top consumers
{code}
 num #instances #bytes  class name
--
   1:  38421509 2049194112  [Ljava.lang.Object;
   2:  26525179 1485410024  
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
   3:  19134601 1071537656  
org.apache.hadoop.hdfs.server.namenode.INodeFile
   4:  16228949  753517120  [B
   5:  12113580  581451840  
org.apache.hadoop.hdfs.server.namenode.INodeDirectory
   6:  19135442  484175352  
[Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
   7:   1621399  403948560  [I
   8:  11895039  285480936  java.util.ArrayList
   9: 1  268435472  
[Lorg.apache.hadoop.hdfs.util.LightWeightGSet$LinkedElement;
{code}

h2. Patched trunk with the map turned off
Total heap: 7528MB (6.5% increase from 2.0.3)

Top consumers
{code}
 num #instances #bytes  class name
--
   1:  38421427 2049187584  [Ljava.lang.Object;
   2:  26525179 1485410024  
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
   3:  19134601 1377691272  
org.apache.hadoop.hdfs.server.namenode.INodeFile
   4:  12113580  775269120  
org.apache.hadoop.hdfs.server.namenode.INodeDirectory
   5:  16228690  753509864  [B
   6:  19135442  484175352  
[Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
   7:   1654298  384726200  [I
   8:  11895040  285480960  java.util.ArrayList
   9: 1  268435472  
[Lorg.apache.hadoop.hdfs.util.LightWeightGSet$LinkedElement;
{code}

h2. Patched trunk with the map turned on
Total heap: 7696MB (8.9% increase from 2.0)

Top consumers
{code}
 num #instances #bytes  class name
--
   1:  38421429 2049187632  [Ljava.lang.Object;
   2:  26525179 1485410024  
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
   3:  19134601 1377691272  
org.apache.hadoop.hdfs.server.namenode.INodeFile
   4:  12113580  775269120  
org.apache.hadoop.hdfs.server.namenode.INodeDirectory
   5:  16228746  753515976  [B
   6:  19135442  484175352  
[Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
   7:   1499494  426158720  [I
   8: 2  402653216  
[Lorg.apache.hadoop.hdfs.util.LightWeightGSet$LinkedElement;
   9:  11895040  285480960  java.util.ArrayList
{code}


I don't think this increased memory is necessarily unacceptable, I just wanted 
to see true measurement of the overhead instead of hypotheses. It looks like 
the increased memory cost is about twice what was estimated above.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-28 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616645#comment-13616645
 ] 

Suresh Srinivas commented on HDFS-4489:
---

[~tlipcon] Thanks for running the tests.

I personally am not concerned about this increased memory. If there are others 
with concerns, I can try reducing memory consumption further at the expense 
more complex code. Thoughts?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-28 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616652#comment-13616652
 ] 

Suresh Srinivas commented on HDFS-4489:
---

BTW my calculations of increased memory is against the total java heap 
allocated to the process than memory used in old generation alone. That is a 
better way to quantify the impact on users, right?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-28 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616658#comment-13616658
 ] 

Suresh Srinivas commented on HDFS-4489:
---

bq. BTW my calculations of increased memory is against the total java heap 
allocated to the process than memory used in old generation alone. That is a 
better way to quantify the impact on users, right?

Sorry my previous comments may not be clear to every one. Increases of 625MB 
from 7069MB to 7696MB is 8.9%, the way I was quantifying was percentage of 
entire java heap memory. That is 625MB out of 24G, that is 2.6%.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-27 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615575#comment-13615575
 ] 

Todd Lipcon commented on HDFS-4489:
---

bq. byte[] name - I assume typically ~56 bytes for this. That is (16 bytes 
object overhead, 8 byte length + bytes that make up file name, say 32)

According to your comment here: 
https://issues.apache.org/jira/browse/HDFS-1110?focusedCommentId=12861548page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12861548
 a typical image with ~50M files will only need ~5M unique name byte[] objects, 
so I think it's unfair to count the above against the inode.

I think you're also adding an extra 8 bytes on the arrays -- the array length 
as I understand it is a field within the 16byte object header (occupying the 
second half of the klassId field).

Regardless, this seems like something that's very easy to test rather than try 
to solve analytically. Do you have results for the additional memory overhead 
of this map on a large production image? If it's truly 3-5%, seems reasonably, 
but I'm afraid it may look closer to 10+% in practice.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-27 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615665#comment-13615665
 ] 

Suresh Srinivas commented on HDFS-4489:
---

bq. I think you're also adding an extra 8 bytes on the arrays – the array 
length as I understand it is a field within the 16byte object header (occupying 
the second half of the klassId field).
If you have an authoritative source, please send me that. I cannot understand 
how 16 byte object header have spare of say possible 8 bytes to track array 
length. Some of of my previous instrumentation had led me to conclude the the 
array length is 4 bytes for 32bit JVM and 8 bytes for 64 bit JVM. See 
discussion here - 
http://www.javamex.com/tutorials/memory/object_memory_usage.shtml.

bq. a typical image with ~50M files will only need ~5M unique name byte[] 
objects, so I think it's unfair to count the above against the inode.
That is a fair point. But my own inodes occupies 1/3rd of java heap is also an 
approximation and in practice I would think it inodes occupy smaller than that.

I would like to run an experiment on a large production image. But I do not 
have ready access to it and will have to spend time getting to it. Do you have 
any?

bq. but I'm afraid it may look closer to 10+% in practice.
I do not think it will be close to 10%, but lets say it is. I do not see much 
issues with it. When we did some of the optimizations earlier, we were not sure 
how JVM would do if goes closes to 64G and hence wanted to keep the heap size 
down. But since then many large installations have successfully, without any 
issues gone beyond that size. Smaller installations should be able to spare, 
say, 10% extra heap. But if that is not acceptable, here are the alternatives I 
see:
# Add configuration options to turn this feature off. Not instantiating GSet 
will reduce the overhead by 1/3rd. This is simple to do.
# Make more optimizations at the expense of code complexity. I would like to 
avoid this. But if it is deemed very important, with some optimizations, we can 
get it close to 0%.


 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-27 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615677#comment-13615677
 ] 

Todd Lipcon commented on HDFS-4489:
---

bq. If you have an authoritative source, please send me that

Sure, from the JDK7 source code hotspot/src/share/vm/oops/arrayOop.hpp:

{code}
// The layout of array Oops is:
//
//  markOop
//  klassOop  // 32 bits if compressed but declared 64 in LP64.
//  length// shares klass memory or allocated after declared fields.
{code}

Important to note that the length of arrays is 32-bit, since array.length is an 
int rather than a long. So given a 64-bit field for klassId, it can use 32-bits 
for the actual class and 32 bits for the array length.

bq. I would like to run an experiment on a large production image. But I do not 
have ready access to it and will have to spend time getting to it. Do you have 
any?

Yes, I can run the experiment on a large image. Is HDFS-4434's patch ready to 
apply so I can test it?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-27 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615709#comment-13615709
 ] 

Suresh Srinivas commented on HDFS-4489:
---

bq. or allocated after declared fields.
Not sure what this means though.

HDFS-4434 patch is ready. Thanks in advance for running the tests.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-27 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615749#comment-13615749
 ] 

Todd Lipcon commented on HDFS-4489:
---

  // The _length field is not declared in C++.  It is allocated after the
  // declared nonstatic fields in arrayOopDesc if not compressed, otherwise
  // it occupies the second half of the _klass field in oopDesc.
  static int length_offset_in_bytes() {
return UseCompressedOops ? klass_gap_offset_in_bytes() :
   sizeof(arrayOopDesc);
  }

Basically if CompressedOops are on, then klassids are only 32-bits, but there's 
already a 64-bit field for it, so it just uses the latter 4 bytes for the array 
length. Otherwise it's an extra 4 bytes that comes after the standard oop 
header (oopDesc). So, without compressed oops, arrays take 20 bytes base. With 
them (on by default on heaps 32GB since 6u18 I believe), the array header is 
the same size as normal objects (16 bytes).

Will take a look at loading a big image with that patch now.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-27 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615755#comment-13615755
 ] 

Suresh Srinivas commented on HDFS-4489:
---

Ran some quick tests for object sizes, using 
https://github.com/dweiss/java-sizeof (pretty neat stuff!)
{code}
  public static void main(String[] args) {
System.out.println(RamUsageEstimator.sizeOf(new Object()));
System.out.println(RamUsageEstimator.sizeOf(new Object[0]));
System.out.println(RamUsageEstimator.sizeOf(new Object[100]));
  }
{code}

With compressed oops on I get:
16
16
416

After turning it off:
16
24
824


 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-27 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615760#comment-13615760
 ] 

Todd Lipcon commented on HDFS-4489:
---

Neat. I'm setting up those tests now... taking a while to clone/build hadoop 
onto the right machine that has enough RAM.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-26 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614692#comment-13614692
 ] 

Todd Lipcon commented on HDFS-4489:
---

bq. Inode size is ~180 bytes and this proposal adds 16-24 bytes per Inode.

How is this calculated? I see the following 5 fields:

{code}
  private byte[] name = null;
  private long permission = 0L;
  protected INodeDirectory parent = null;
  protected long modificationTime = 0L;
  protected long accessTime = 0L;
{code}

for a total of 40 bytes on a 64-bit JVM. So, adding 16-24 bytes is a pretty 
substantial new memory use.

I agree with ATM that this should go on a branch since it's fairly invasive. 
Once the branch is working, we can evaluate the benefit of the new feature vs 
the measured cost (both memory and additional CPU to manage this new structure)

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-26 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614694#comment-13614694
 ] 

Todd Lipcon commented on HDFS-4489:
---

(I guess I should add the subclass fields, in which case INodeFile has another 
two 8-byte fields, plus the associated array object for BlockInfo, etc)... but 
still seems to come in a lot less than 180 bytes.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-26 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614721#comment-13614721
 ] 

Suresh Srinivas commented on HDFS-4489:
---

bq. for a total of 40 bytes on a 64-bit JVM. So, adding 16-24 bytes is a pretty 
substantial new memory use.
Here are the things that goes into ~180 bytes:
INode is an object. It comes with the cost of 16 bytes object header overhead. 
Members include:
# byte[] name - I assume typically ~56 bytes for this. That is (16 bytes object 
overhead, 8 byte length + bytes that make up file name, say 32)
# reference to byte[] name - 8 bytes
# long permission at the cost of 8 bytes.
# parent reference at 8 bytes cost
# modification time at 8 bytes cost
# accessTime at 8 bytes cost

That is roughly ~112 bytes.

Typically most of the INodes are INode files (I will leave the other type of 
inodes as an exercise).
# It has BlockInfo[]. This is again 16 bytes of object, 8 bytes length, say two 
blocks in a file with two references, with a cost of 40 bytes.
# It has long header that adds another 8 bytes.

Total ~160 bytes. So it is not very far off and the number I had posted was 
based on what I had calculated long back.

That said, 16-24 might seem like a huge percentage (10 to 15%) of INode size. 
But what is the amount of memory in NN heap that is allocate for Inodes. 
Assuming Inodes make up for 1/3, blocks make up for another 1/3, remaining 1/3 
for floating garbage, head room etc, the net impact on NN heap is 3 to 5%. That 
is not far off from the analysis posted above.

I believe half of the work is already in trunk. Remaining two jiras need to go 
in. I believe doing a branch at this point in time is unnecessary work.

If you are concerned about memory usage of your installs, I can add a config 
option and not instantiate the map. 




 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-26 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614802#comment-13614802
 ] 

Suresh Srinivas commented on HDFS-4489:
---

One more thing I forgot to add. There are many optimizations that are possible 
to reduce the memory consumed. It comes at the cost of code complexity and not 
so clean abstractions. I would rather avoid it and go for additional memory 
given newer machines are coming with more memory, than make the code unclean.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-25 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613236#comment-13613236
 ] 

Suresh Srinivas commented on HDFS-4489:
---

Given that the subtasks do not break the trunk, I plan to start reviewing 
individual jiras and committing the patches attached to subtasks. Some of these 
patches are already committed to trunk.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-25 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613240#comment-13613240
 ] 

Aaron T. Myers commented on HDFS-4489:
--

Why not do this on a branch? That makes the most sense to me, given that the 
individual patches themselves don't make a lot of sense when considered 
individually.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-25 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613261#comment-13613261
 ] 

Suresh Srinivas commented on HDFS-4489:
---

[~atm] Can you describe which of the individual patches do not make sense to 
you? I thought the previous comments indicated that it was not clear how the 
overall design is and how the pieces fit together. Now that this jiras 
describes the over all motivation, approach being taken, I hope there is more 
clarity.
 

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-25 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613269#comment-13613269
 ] 

Aaron T. Myers commented on HDFS-4489:
--

Sorry if I wasn't clear - all the patches make sense to me, it's just that 
several of them don't really stand on their own, so it seems like we should 
work on the whole work on a separate branch, get the feature in shape there, 
and then merge it back to trunk once the whole project is completed.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-25 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613293#comment-13613293
 ] 

Suresh Srinivas commented on HDFS-4489:
---

bq. all the patches make sense to me, it's just that several of them don't 
really stand on their own
I am not sure I understand this. One of the main reasons for a feature branch 
(at least for me), while during development, we may break trunk. But in this 
case that is not the case. 

I have cleaned up the list of subtasks in this jira. Hopefully the subtasks 
should make it more clear.

Let me add some details about individual jiras and that should help in 
understanding them better:
# HDFS-4334 - Adds unique ID to each INode.
# HDFS-4346 - Refactored the code to remove code duplication between INode 
generation and block ID generation
# HDFS-4339 - Persist the INode in fsimage.
# HDFS-4434 - Introduce a map of inode ID to inode so that inodeid/fileid can 
be used as an identifier to address a file


 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-25 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613338#comment-13613338
 ] 

Suresh Srinivas commented on HDFS-4489:
---

bq. At one point in time I believe the INodeID work did indeed break WebHDFS on 
trunk.
That is because of a bug introduced in the change. When I say *break* in my 
previous comment, it is breaking the functionality because incomplete set of 
changes where HDFS is not functional and not a bug in the code committed.

bq. I don't understand the resistance to doing this on a feature branch. What's 
the concern with doing so?
I am not resisting it, I do not see a need for it. I believe we have two more 
jiras to go in. Other jiras are already in. I think moving those commits to a 
separate branch, adding mere two more commits in that branch, calling for merge 
vote is unnecessary waste of time and I want to avoid it.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-25 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613300#comment-13613300
 ] 

Aaron T. Myers commented on HDFS-4489:
--

bq. One of the main reasons for a feature branch (at least for me), while 
during development, we may break trunk. But in this case that is not the case.

At one point in time I believe the INodeID work did indeed break WebHDFS on 
trunk.

Another reason for using a development branch is because the feature isn't 
necessarily complete without certain patches having been committed. The fact 
that HDFS-4339 (persist INodeIDs in the fsimage) isn't committed yet suggests 
that this feature won't really work as-intended until that's committed, but yet 
we've already committed other patches involving INodeIDs to trunk. That doesn't 
make much sense to me.

I don't understand the resistance to doing this on a feature branch. What's the 
concern with doing so?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-25 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613343#comment-13613343
 ] 

Aaron T. Myers commented on HDFS-4489:
--

bq. I am not resisting it, I do not see a need for it. I believe we have two 
more jiras to go in. Other jiras are already in. I think moving those commits 
to a separate branch, adding mere two more commits in that branch, calling for 
merge vote is unnecessary waste of time and I want to avoid it.

Alright, go for it. I'll repeat my claim that this work should have been done 
on a branch to begin with, but c'est la vie.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-15 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603642#comment-13603642
 ] 

Suresh Srinivas commented on HDFS-4489:
---

h2. Introduction
This jira proposes to introduce InodeID in HDFS. InodeID uniquely identifies an 
instance of an inode such as directory, file, symbolic links etc. independent 
of the file path name. This helps in several use cases:
# HDFS can evolve to support ID based protocols such as NFS. We plan to add an 
experimental NFS V3 gateway to HDFS using this mechanism. Will post a github 
link soon.
# InodeID can be used by the tools to track a single instance of a file, for 
cacheing data or tracking and checking for modification based on INodeID, in 
tools like distcp.
# Path cannot identify a unique instance of a file. This causes issues as 
described in HDFS-4258 and HDFS-4437. It has also been a requirement of many 
other jiras such as HDFS-385.
# Using InodeID as an identifier instead of path can be more efficient than 
path bases accesses. 

h4. New Inode identifier
It is 64 bit long number and has the following properties:
# Number 0 is reserved and would be used for identifying invalid/default value.
# Numbers 1-1023 are reserved for some unforeseen future requirements. The 
InodeID starts from 1024.
# An InodeID is never re-used in a single namenode namespace.

h3. General overview of the changes required
Applications discover the InodeID by getting the FileStatus for an Inode or 
when an Inode is created or opened for append. FileStatus will be changed to 
include InodeID. Create and append will be changed to return FileStatus as well.

For other APIs that use path (Path or String) to identify a file we have two 
choices:
# *API Change*: Add another variant of the API that uses InodeID to identify a 
file or add additional parameter InodeID to the API.
# *No API Change*: Use the path to send the InodeID and keep the API changes to 
a minimum. Example, one could use path {{/.inodes/inodesID}}, where .inodes 
is a reserved name to identify the path that pass InodeID instead of regular 
path. This similar to /proc used for special purposes on *nix.

h4. InodeID to Inode map
A new map (based on GSet) will be introduced in the namenode to map a given 
InodeID to an Inode. This is in addition to existing map of BlockID to 
BlockInfo.

h4. Additional memory consumption
Adding all this will require additional memory in the namenode.

* 8 byte InodeID into Inode object results in a cost of 8 bytes per Inode. As 
proposed in HDFS-4258, this can be folded into existing modification and access 
time.
* Introducing InodeID to Inode GSet results in additional memory of 16 bytes 
per Inode:
** 8 * size of Gset (where size of GSet could be as big as number of Inodes)
** 8 bytes per Inode for a java reference (pointer to next element as required 
by GSet)

Inodes and related objects consume approximately 1/3 of the JVM heap in a 
system that is full. Inode size is ~180 bytes and this proposal adds 16-24 
bytes per Inode. With change the JVM heap needs to be increased by 3% to 4.5%. 
While further optimizations are possible to reduce this size further, it adds 
needless code complexity.

h4. Security concerns
InodeID is an alternate name similar to path. All the existing security 
mechanism that applies to path (that is ensuring permissions are checked from 
the root to the Inode) will also be done for InodeID based access.


 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira