subject:"\[jira\] \[Commented\] \(HDFS\-4489\) Use InodeID as as an identifier of a file in HDFS protocols and APIs"

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642185#comment-13642185
]

Suresh Srinivas commented on HDFS-4489:
---

bq. My concern is that you committed incompatible change
Konstantin, not sure if you looked at the release notes. This change disallows
a file or directory name called .reserved under root. That is the only reason
why I marked it as incompatible. This is not related to wire or API
incompatibility. That said, one of the goal for 2.0.5 is drive towards a state
where incompatible changes are not allowed after it.

bq. which is a new feature and a large change, into the stabilization branch
without a vote or a release plan discussed with the community.
I agree that this is a new features. Committers routinely promote changes that
they consider are okay to branch-2. I believe this does not add to the
instability. Let me know if you disagree based on a code review/testing.

Also merging to branch-2 in a lot of cases is done based on a committer's
judgement. Please look various other jiras that are merged in without vote
thread into branch-2. I do not consider this as a large feature. However for
Snapshot feature, I would have brought up that in a release thread.

bq. . And you didn't give any reasons for the merge.
I think there is enough motivation for the feature posted in the jira.

bq. I would like to ask to revert this merge from branch 2.0.5 and follow the
procedures for merging features into new release branches if you decide to
proceed.
I have spent more than 12 hours merging the chain of jiras required and
resolving conflict before getting to 4 changes that introduced file id. Is your
concern about HDFS-4434 or all the related changes?

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
Fix For: 2.0.5-beta

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-25 Thread Nathan Roberts (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642212#comment-13642212
]

Nathan Roberts commented on HDFS-4489:
--

Sorry this is a really late comment but I'd really like to see some performance
numbers before and after. While 6.5% increase in overall heap size is not
massive, my main concern is the 25% increase in a very core data structure
within the NN (1.07G-1.37G in Todd's measurement of INodeFile). This could
cause significant cache pollution and therefore could have a very measurable
impact on performance. I don't know for sure that it will, but it seems it
would be reasonable to verify.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
Fix For: 2.0.5-beta

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642225#comment-13642225
 ] 

Suresh Srinivas commented on HDFS-4489:
---

[~nroberts] What performance test would like to be run with and without this 
change?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642242#comment-13642242
 ] 

Suresh Srinivas commented on HDFS-4489:
---

I have reverted HDFS-4434 from branch-2. Will post the performance numbers and 
then commit the change to branch-2, based on that discussion.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
 Fix For: 2.0.5-beta


 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-25 Thread Sanjay Radia (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642290#comment-13642290
]

Sanjay Radia commented on HDFS-4489:

Nathan. A question.
Suresh is willing to do the performance benchmark, but I am trying to
understand where you are coming from. Yahoo and FB create very large
namespaces by simply buying more memory and increasing the size of the heap. Do
you worry about cache pollution when you create 50K more files? Given that the
NN heap (many GBs) is so much larger than the cache, does the additional inode
and inode-map size impact the overall system performance? Suresh has argued
that a 24GB heap grows by 625MB. Looking at the growth in memory of this
feature as a percentage of the total heap size is a more realistic way of
looking at the impact of the growth than the growth of an individual data
structure like the inode.

IMHO, not having an inode-map and inode number was a serious limitation in the
original implementation of NN. I am willing to pay for the extra memory given
the value inode-id and inode-map brings (as described by suresh in the
beginning of this Jira). Permissions, access time, etc added to the memory
cost of the the NN and were accepted because of the value they bring.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
Fix For: 2.0.5-beta

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-25 Thread Konstantin Shvachko (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642355#comment-13642355
]

Konstantin Shvachko commented on HDFS-4489:
---

Suresh whatever reason for incompatibility it should go through approval
process.
You also committed the LayoutVersion change HDFS-4296. Now it requires an
upgrade.

one of the goal for 2.0.5 is drive towards a state where incompatible changes
are not allowed after it.

That was the goal for Hadoop 0.20.
I thought the goal for 2.0.5 is stabilization.

Also merging to branch-2 in a lot of cases is done based on a committer's
judgement.

I think it is wrong. Especially for the stabilization release.

I think there is enough motivation for the feature posted in the jira.

Not arguing about the value of the feature. But about its necessity for 2.0.5

Is your concern about HDFS-4434 or all the related changes?

Most of them. I would have reviewed if I had a proper warning.
So again why is it necessary for 2.0.5?

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
Fix For: 2.0.5-beta

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642455#comment-13642455
]

Suresh Srinivas commented on HDFS-4489:
---

{quote}
That was the goal for Hadoop 0.20.
I thought the goal for 2.0.5 is stabilization.
{quote}
I am not sure if 0.20 is a typo. If it is not, I have hard time parsing that
statement. See the previous discussion about 2.0.4-beta (now called 2.0.5) in
this thread:
http://hadoop.markmail.org/thread/v44nqp466p76jpkj

bq. I think it is wrong. Especially for the stabilization release.
I disagree. I want to get some of the features I have been working on into this
release. I think the goal of this release is to get API and wire compatibility
stable.

bq. Most of them. I would have reviewed if I had a proper warning.
I am not sure what kind of warning you are talking about. HDFS-4434 has been in
development for a long time with more than 32 iterations of the patch.

bq. So again why is it necessary for 2.0.5?
Snapshot and NFS feature depends on this. I would like see it become available
in 2.0.5.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li
Fix For: 2.0.5-beta

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-24 Thread Suresh Srinivas (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13640438#comment-13640438
 ] 

Suresh Srinivas commented on HDFS-4489:
---

I am planning to push the subtasks of this jira to release 2.0.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-16 Thread Suresh Srinivas (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633360#comment-13633360
 ] 

Suresh Srinivas commented on HDFS-4489:
---

For people who are following this jira, HDFS-4434 is now ready for review and 
commit. Please provide any feedback you have soon. otherwise the comments that 
come late will have to be incorporated in a subsequent jira.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-11 Thread Daryn Sharp (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628931#comment-13628931
]

Daryn Sharp commented on HDFS-4489:
---

bq. {quote}Perhaps ASN.1 encoding the long for the inode id will significantly
decrease the memory consumption?{quote}
bq. Can you add more details on how this would decrease memory consumption?

If the long is encoded as a variable length byte array, it should take a long
time to exceed 4-5 bytes. With minimal effort complexity, the memory
increase would nominally be cut in half for many deployments. Just a
suggestion.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-11 Thread Arpit Agarwal (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629197#comment-13629197
]

Arpit Agarwal commented on HDFS-4489:
-

{quote}
If the long is encoded as a variable length byte array, it should take a long
time to exceed 4-5 bytes. With minimal effort complexity, the memory increase
would nominally be cut in half for many deployments.
{quote}
This would save space when serializing the fsImage. I am not sure if we can
reduce in-memory usage below the size of a primitive long since the byte array
is an object.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-10 Thread Daryn Sharp (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627920#comment-13627920
 ] 

Daryn Sharp commented on HDFS-4489:
---

Perhaps ASN.1 encoding the long for the inode id will significantly decrease 
the memory consumption?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-10 Thread Suresh Srinivas (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628050#comment-13628050
]

Suresh Srinivas commented on HDFS-4489:
---

bq. Perhaps ASN.1 encoding the long for the inode id will significantly
decrease the memory consumption?
Can you add more details on how this would decrease memory consumption? BTW
inodeID was added as a part of HDFS-4334. See the discussion about how reduce
the impact of adding inode ID -
https://issues.apache.org/jira/browse/HDFS-4258?focusedCommentId=13508432page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13508432.

But I am not sure if that optimization is necessary at the expense of code.
Thoughts?

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-10 Thread Kihwal Lee (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628057#comment-13628057
]

Kihwal Lee commented on HDFS-4489:
--

bq. With this change, it is expected that NN is allocated more memory, say 5%.
If this is done I am not sure why users should be told namespace limit is X%
worse?

In many use cases, allocating more heap may not be a problem since machines
typically have more memory available. But if you approach from the view point
of owners of existing hardware that was spec'ed to hold certain size of
namespace, it can be viewed as a decrease of capacity. I am not saying it is a
showstopper. I just felt it should be given more thought.

I will review the implementation and try to understand your concerns about more
memory efficient design.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-10 Thread Suresh Srinivas (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628197#comment-13628197
]

Suresh Srinivas commented on HDFS-4489:
---

bq. But if you approach from the view point of owners of existing hardware that
was spec'ed to hold certain size of namespace, it can be viewed as a decrease
of capacity.
Again I do not believe anyone runs with NN very tightly configured given the
nature garbage collection. That said, to make further progress, the following
optimizations can be done:

# Initialize the map only when this feature is enabled. Should take away
roughly 1/3 of extra memory.
# Reuse existing bits in INodeId -
https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12618468commentId=13508432.
Should take away roughly 1/3 of extra memory.
# Use first block ID of the file (after ensuring even empty file has an
associated block) as the InodeID. This is very ugly and mixing two abstractions
that should not be mixed. I am reluctant to make this optimization.

My vote is to keep the code simple, abstractions clean. If folks think the
above optimizations is worth pursuing, I will update the patch.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-10 Thread Brandon Li (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628217#comment-13628217
]

Brandon Li commented on HDFS-4489:
--

{quote}I am not saying it is a showstopper. I just felt it should be given more
thought. {quote}
In many cases, a trade-off is involved with the introduction of a new feature
or enhancement.
This JIRA was forked from HDFS-4258 and the discussion/experiment has been
going on for more than 4 months.

As shown in the theory analysis and experiment results, the memory overhead of
this change is not significant. It doesn't seems to be worthwhile for now to
complicate NameNode code to do the extra optimizations.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-09 Thread Daryn Sharp (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626799#comment-13626799
 ] 

Daryn Sharp commented on HDFS-4489:
---

Maybe something simple like GridMix to get a rough feel for the overhead of the 
extra resolution.  I don't expect it to be much, but it'd be nice to know.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-09 Thread Kihwal Lee (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626868#comment-13626868
]

Kihwal Lee commented on HDFS-4489:
--

bq. Please look at the overall increase in memory usage instead of increase
over used memory.
Your point would be valid only if the overhead was entirely a fixed amount
(e.g. GSet). Since the extra memory consumption increases as the size of
namespace grows, factoring the arbitrary max heap size into this can be
misleading. But I agree that the 9% figure does not have an absolute meaning
either. If the inode-to-block ratio is different, the number will be different.
For the clusters I have seen, it will be a lower number. The GSet used for
InodeID to INode map is also semi-fixed. Is it allocated similarly to
BlocksMap?

In any case, I would not call this insignificant. We have a namenode which will
not work well if we upgrade to a release with this feature since it will need
extra 4-6GB for the steady-state operation. Even if it could absorb the extra
memory requirement, we would have to tell users that the namespace limit is X%
worse.

Simply saying the overhead is insignificant won't convince users. We should
explain why the benefit from having this feature justifies the overhead. I
don't think on/off switch is necessary.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-09 Thread Suresh Srinivas (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626881#comment-13626881
]

Suresh Srinivas commented on HDFS-4489:
---

bq. The GSet used for InodeID to INode map is also semi-fixed. Is it allocated
similarly to BlocksMap?
Yes. Please see the patch in HDFS-4434. About 1% of heap is used for the GSet.

bq. Simply saying the overhead is insignificant won't convince users. We should
explain why the benefit from having this feature justifies the overhead. I
don't think on/off switch is necessary.
I think the assertion here is not overhead is insignificant. Depending on
details of how the namespace of a system is laid out, I would think this would
be anywhere from 2 to 5%.

As far the benefits, in the main description I laid this out:

---
This helps in several use cases:
# HDFS can evolve to support ID based protocols such as NFS. We plan to add an
experimental NFS V3 gateway to HDFS using this mechanism. Will post a github
link soon.
# InodeID can be used by the tools to track a single instance of a file, for
cacheing data or tracking and checking for modification based on INodeID, in
tools like distcp.
# Path cannot identify a unique instance of a file. This causes issues as
described in HDFS-4258 and HDFS-4437. It has also been a requirement of many
other jiras such as HDFS-385.
# Using InodeID as an identifier instead of path can be more efficient than
path bases accesses.
---

bq. We have a namenode which will not work well if we upgrade to a release with
this feature since it will need extra 4-6GB for the steady-state operation.
Even if it could absorb the extra memory requirement, we would have to tell
users that the namespace limit is X% worse.
Is this because namenode does not have RAM? With this change, it is expected
that NN is allocated more memory, say 5%. If this is done I am not sure why
users should be told namespace limit is X% worse?

My rationale, repeating what I said earlier is, machines are becoming
available with more RAM. Adding 5% JVM heap should not be a problem. In fact
most of the namenodes are configured with enough head room already and might
not even need a change. But if this is a big concern, I am okay making
additional change to bring down the memory consumption close to zero.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-08 Thread Daryn Sharp (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625918#comment-13625918
 ] 

Daryn Sharp commented on HDFS-4489:
---

I've only skimmed this jira, but a 9% increase is fairly substantial for large 
namespaces.  Are there any performance metrics available?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-04-08 Thread Suresh Srinivas (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625969#comment-13625969
]

Suresh Srinivas commented on HDFS-4489:
---

bq. 9% increase is fairly substantial for large namespaces.
Please look at the overall increase in memory usage instead of increase over
used memory. As I said that is close 2.6%.

bq. Are there any performance metrics available?
I do not see much concern here. In fact I removed the flag to turn this feature
on or off. If you think based on the code this is a concern, I could add the
flag back. What metrics would you like to see?

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-29 Thread Suresh Srinivas (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617328#comment-13617328
 ] 

Suresh Srinivas commented on HDFS-4489:
---

Any further comments? I plan to wrap up HDFS-4334 soon. If there are no further 
concerns, I do not plan on optimizing memory further at the expense of code 
complexity.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-28 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616634#comment-13616634
 ] 

Todd Lipcon commented on HDFS-4489:
---

Here's the results from the latest patch:

h2. Setup
- Java 6u31 convigured with a 24gb heap (-Xms24g -Xmx24g)
- fsimage is 4.1GB on disk, snapshot from a mid size production cluster which 
runs both hbase and some MR workloads.
- 31249022 files and directories, 26525575 blocks = 57774597 total filesystem 
objects.

In each test, I started the NameNode, waited until it had loaded the image and 
opened its IPC port, and then used jmap -histo:live, which issues a full GC 
and reports heap usage statistics.

h2. 2.0.3-beta release
Total heap: 7069MB

Top consumers
{code}
 num #instances #bytes  class name
--
   1:  38421509 2049194112  [Ljava.lang.Object;
   2:  26525179 1485410024  
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
   3:  19134601 1071537656  
org.apache.hadoop.hdfs.server.namenode.INodeFile
   4:  16228949  753517120  [B
   5:  12113580  581451840  
org.apache.hadoop.hdfs.server.namenode.INodeDirectory
   6:  19135442  484175352  
[Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
   7:   1621399  403948560  [I
   8:  11895039  285480936  java.util.ArrayList
   9: 1  268435472  
[Lorg.apache.hadoop.hdfs.util.LightWeightGSet$LinkedElement;
{code}

h2. Patched trunk with the map turned off
Total heap: 7528MB (6.5% increase from 2.0.3)

Top consumers
{code}
 num #instances #bytes  class name
--
   1:  38421427 2049187584  [Ljava.lang.Object;
   2:  26525179 1485410024  
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
   3:  19134601 1377691272  
org.apache.hadoop.hdfs.server.namenode.INodeFile
   4:  12113580  775269120  
org.apache.hadoop.hdfs.server.namenode.INodeDirectory
   5:  16228690  753509864  [B
   6:  19135442  484175352  
[Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
   7:   1654298  384726200  [I
   8:  11895040  285480960  java.util.ArrayList
   9: 1  268435472  
[Lorg.apache.hadoop.hdfs.util.LightWeightGSet$LinkedElement;
{code}

h2. Patched trunk with the map turned on
Total heap: 7696MB (8.9% increase from 2.0)

Top consumers
{code}
 num #instances #bytes  class name
--
   1:  38421429 2049187632  [Ljava.lang.Object;
   2:  26525179 1485410024  
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
   3:  19134601 1377691272  
org.apache.hadoop.hdfs.server.namenode.INodeFile
   4:  12113580  775269120  
org.apache.hadoop.hdfs.server.namenode.INodeDirectory
   5:  16228746  753515976  [B
   6:  19135442  484175352  
[Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
   7:   1499494  426158720  [I
   8: 2  402653216  
[Lorg.apache.hadoop.hdfs.util.LightWeightGSet$LinkedElement;
   9:  11895040  285480960  java.util.ArrayList
{code}


I don't think this increased memory is necessarily unacceptable, I just wanted 
to see true measurement of the overhead instead of hypotheses. It looks like 
the increased memory cost is about twice what was estimated above.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-28 Thread Suresh Srinivas (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616645#comment-13616645
 ] 

Suresh Srinivas commented on HDFS-4489:
---

[~tlipcon] Thanks for running the tests.

I personally am not concerned about this increased memory. If there are others 
with concerns, I can try reducing memory consumption further at the expense 
more complex code. Thoughts?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-28 Thread Suresh Srinivas (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616652#comment-13616652
 ] 

Suresh Srinivas commented on HDFS-4489:
---

BTW my calculations of increased memory is against the total java heap 
allocated to the process than memory used in old generation alone. That is a 
better way to quantify the impact on users, right?

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-28 Thread Suresh Srinivas (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616658#comment-13616658
]

Suresh Srinivas commented on HDFS-4489:
---

bq. BTW my calculations of increased memory is against the total java heap
allocated to the process than memory used in old generation alone. That is a
better way to quantify the impact on users, right?

Sorry my previous comments may not be clear to every one. Increases of 625MB
from 7069MB to 7696MB is 8.9%, the way I was quantifying was percentage of
entire java heap memory. That is 625MB out of 24G, that is 2.6%.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615575#comment-13615575
]

Todd Lipcon commented on HDFS-4489:
---

bq. byte[] name - I assume typically ~56 bytes for this. That is (16 bytes
object overhead, 8 byte length + bytes that make up file name, say 32)

According to your comment here:
https://issues.apache.org/jira/browse/HDFS-1110?focusedCommentId=12861548page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12861548
a typical image with ~50M files will only need ~5M unique name byte[] objects,
so I think it's unfair to count the above against the inode.

I think you're also adding an extra 8 bytes on the arrays -- the array length
as I understand it is a field within the 16byte object header (occupying the
second half of the klassId field).

Regardless, this seems like something that's very easy to test rather than try
to solve analytically. Do you have results for the additional memory overhead
of this map on a large production image? If it's truly 3-5%, seems reasonably,
but I'm afraid it may look closer to 10+% in practice.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-27 Thread Suresh Srinivas (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615665#comment-13615665
]

Suresh Srinivas commented on HDFS-4489:
---

bq. I think you're also adding an extra 8 bytes on the arrays – the array
length as I understand it is a field within the 16byte object header (occupying
the second half of the klassId field).
If you have an authoritative source, please send me that. I cannot understand
how 16 byte object header have spare of say possible 8 bytes to track array
length. Some of of my previous instrumentation had led me to conclude the the
array length is 4 bytes for 32bit JVM and 8 bytes for 64 bit JVM. See
discussion here -
http://www.javamex.com/tutorials/memory/object_memory_usage.shtml.

bq. a typical image with ~50M files will only need ~5M unique name byte[]
objects, so I think it's unfair to count the above against the inode.
That is a fair point. But my own inodes occupies 1/3rd of java heap is also an
approximation and in practice I would think it inodes occupy smaller than that.

I would like to run an experiment on a large production image. But I do not
have ready access to it and will have to spend time getting to it. Do you have
any?

bq. but I'm afraid it may look closer to 10+% in practice.
I do not think it will be close to 10%, but lets say it is. I do not see much
issues with it. When we did some of the optimizations earlier, we were not sure
how JVM would do if goes closes to 64G and hence wanted to keep the heap size
down. But since then many large installations have successfully, without any
issues gone beyond that size. Smaller installations should be able to spare,
say, 10% extra heap. But if that is not acceptable, here are the alternatives I
see:
# Add configuration options to turn this feature off. Not instantiating GSet
will reduce the overhead by 1/3rd. This is simple to do.
# Make more optimizations at the expense of code complexity. I would like to
avoid this. But if it is deemed very important, with some optimizations, we can
get it close to 0%.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615677#comment-13615677
]

Todd Lipcon commented on HDFS-4489:
---

bq. If you have an authoritative source, please send me that

Sure, from the JDK7 source code hotspot/src/share/vm/oops/arrayOop.hpp:

{code}
// The layout of array Oops is:
//
// markOop
// klassOop // 32 bits if compressed but declared 64 in LP64.
// length// shares klass memory or allocated after declared fields.
{code}

Important to note that the length of arrays is 32-bit, since array.length is an
int rather than a long. So given a 64-bit field for klassId, it can use 32-bits
for the actual class and 32 bits for the array length.

bq. I would like to run an experiment on a large production image. But I do not
have ready access to it and will have to spend time getting to it. Do you have
any?

Yes, I can run the experiment on a large image. Is HDFS-4434's patch ready to
apply so I can test it?

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-27 Thread Suresh Srinivas (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615709#comment-13615709
 ] 

Suresh Srinivas commented on HDFS-4489:
---

bq. or allocated after declared fields.
Not sure what this means though.

HDFS-4434 patch is ready. Thanks in advance for running the tests.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615749#comment-13615749
]

Todd Lipcon commented on HDFS-4489:
---

// The _length field is not declared in C++. It is allocated after the
// declared nonstatic fields in arrayOopDesc if not compressed, otherwise
// it occupies the second half of the _klass field in oopDesc.
static int length_offset_in_bytes() {
return UseCompressedOops ? klass_gap_offset_in_bytes() :
sizeof(arrayOopDesc);
}

Basically if CompressedOops are on, then klassids are only 32-bits, but there's
already a 64-bit field for it, so it just uses the latter 4 bytes for the array
length. Otherwise it's an extra 4 bytes that comes after the standard oop
header (oopDesc). So, without compressed oops, arrays take 20 bytes base. With
them (on by default on heaps 32GB since 6u18 I believe), the array header is
the same size as normal objects (16 bytes).

Will take a look at loading a big image with that patch now.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-27 Thread Suresh Srinivas (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615755#comment-13615755
 ] 

Suresh Srinivas commented on HDFS-4489:
---

Ran some quick tests for object sizes, using 
https://github.com/dweiss/java-sizeof (pretty neat stuff!)
{code}
  public static void main(String[] args) {
System.out.println(RamUsageEstimator.sizeOf(new Object()));
System.out.println(RamUsageEstimator.sizeOf(new Object[0]));
System.out.println(RamUsageEstimator.sizeOf(new Object[100]));
  }
{code}

With compressed oops on I get:
16
16
416

After turning it off:
16
24
824


 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615760#comment-13615760
 ] 

Todd Lipcon commented on HDFS-4489:
---

Neat. I'm setting up those tests now... taking a while to clone/build hadoop 
onto the right machine that has enough RAM.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-26 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614692#comment-13614692
 ] 

Todd Lipcon commented on HDFS-4489:
---

bq. Inode size is ~180 bytes and this proposal adds 16-24 bytes per Inode.

How is this calculated? I see the following 5 fields:

{code}
  private byte[] name = null;
  private long permission = 0L;
  protected INodeDirectory parent = null;
  protected long modificationTime = 0L;
  protected long accessTime = 0L;
{code}

for a total of 40 bytes on a 64-bit JVM. So, adding 16-24 bytes is a pretty 
substantial new memory use.

I agree with ATM that this should go on a branch since it's fairly invasive. 
Once the branch is working, we can evaluate the benefit of the new feature vs 
the measured cost (both memory and additional CPU to manage this new structure)

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-26 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614694#comment-13614694
 ] 

Todd Lipcon commented on HDFS-4489:
---

(I guess I should add the subclass fields, in which case INodeFile has another 
two 8-byte fields, plus the associated array object for BlockInfo, etc)... but 
still seems to come in a lot less than 180 bytes.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-26 Thread Suresh Srinivas (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614721#comment-13614721
]

Suresh Srinivas commented on HDFS-4489:
---

bq. for a total of 40 bytes on a 64-bit JVM. So, adding 16-24 bytes is a pretty
substantial new memory use.
Here are the things that goes into ~180 bytes:
INode is an object. It comes with the cost of 16 bytes object header overhead.
Members include:
# byte[] name - I assume typically ~56 bytes for this. That is (16 bytes object
overhead, 8 byte length + bytes that make up file name, say 32)
# reference to byte[] name - 8 bytes
# long permission at the cost of 8 bytes.
# parent reference at 8 bytes cost
# modification time at 8 bytes cost
# accessTime at 8 bytes cost

That is roughly ~112 bytes.

Typically most of the INodes are INode files (I will leave the other type of
inodes as an exercise).
# It has BlockInfo[]. This is again 16 bytes of object, 8 bytes length, say two
blocks in a file with two references, with a cost of 40 bytes.
# It has long header that adds another 8 bytes.

Total ~160 bytes. So it is not very far off and the number I had posted was
based on what I had calculated long back.

That said, 16-24 might seem like a huge percentage (10 to 15%) of INode size.
But what is the amount of memory in NN heap that is allocate for Inodes.
Assuming Inodes make up for 1/3, blocks make up for another 1/3, remaining 1/3
for floating garbage, head room etc, the net impact on NN heap is 3 to 5%. That
is not far off from the analysis posted above.

I believe half of the work is already in trunk. Remaining two jiras need to go
in. I believe doing a branch at this point in time is unnecessary work.

If you are concerned about memory usage of your installs, I can add a config
option and not instantiate the map.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

2013-03-26 Thread Suresh Srinivas (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614802#comment-13614802
]

Suresh Srinivas commented on HDFS-4489:
---

One more thing I forgot to add. There are many optimizations that are possible
to reduce the memory consumed. It comes at the cost of code complexity and not
so clean abstractions. I would rather avoid it and go for additional memory
given newer machines are coming with more memory, than make the code unclean.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613236#comment-13613236
 ] 

Suresh Srinivas commented on HDFS-4489:
---

Given that the subtasks do not break the trunk, I plan to start reviewing 
individual jiras and committing the patches attached to subtasks. Some of these 
patches are already committed to trunk.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs


[ 
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613240#comment-13613240
 ] 

Aaron T. Myers commented on HDFS-4489:
--

Why not do this on a branch? That makes the most sense to me, given that the 
individual patches themselves don't make a lot of sense when considered 
individually.

 Use InodeID as as an identifier of a file in HDFS protocols and APIs
 

 Key: HDFS-4489
 URL: https://issues.apache.org/jira/browse/HDFS-4489
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

 The benefit of using InodeID to uniquely identify a file can be multiple 
 folds. Here are a few of them:
 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, 
 HDFS-4437.
 2. modification checks in tools like distcp. Since a file could have been 
 replaced or renamed to, the file name and size combination is no t reliable, 
 but the combination of file id and size is unique.
 3. id based protocol support (e.g., NFS)
 4. to make the pluggable block placement policy use fileid instead of 
 filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613261#comment-13613261
]

Suresh Srinivas commented on HDFS-4489:
---

[~atm] Can you describe which of the individual patches do not make sense to
you? I thought the previous comments indicated that it was not clear how the
overall design is and how the pieces fit together. Now that this jiras
describes the over all motivation, approach being taken, I hope there is more
clarity.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613269#comment-13613269
]

Aaron T. Myers commented on HDFS-4489:
--

Sorry if I wasn't clear - all the patches make sense to me, it's just that
several of them don't really stand on their own, so it seems like we should
work on the whole work on a separate branch, get the feature in shape there,
and then merge it back to trunk once the whole project is completed.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613293#comment-13613293
]

Suresh Srinivas commented on HDFS-4489:
---

bq. all the patches make sense to me, it's just that several of them don't
really stand on their own
I am not sure I understand this. One of the main reasons for a feature branch
(at least for me), while during development, we may break trunk. But in this
case that is not the case.

I have cleaned up the list of subtasks in this jira. Hopefully the subtasks
should make it more clear.

Let me add some details about individual jiras and that should help in
understanding them better:
# HDFS-4334 - Adds unique ID to each INode.
# HDFS-4346 - Refactored the code to remove code duplication between INode
generation and block ID generation
# HDFS-4339 - Persist the INode in fsimage.
# HDFS-4434 - Introduce a map of inode ID to inode so that inodeid/fileid can
be used as an identifier to address a file

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613338#comment-13613338
]

Suresh Srinivas commented on HDFS-4489:
---

bq. At one point in time I believe the INodeID work did indeed break WebHDFS on
trunk.
That is because of a bug introduced in the change. When I say *break* in my
previous comment, it is breaking the functionality because incomplete set of
changes where HDFS is not functional and not a bug in the code committed.

bq. I don't understand the resistance to doing this on a feature branch. What's
the concern with doing so?
I am not resisting it, I do not see a need for it. I believe we have two more
jiras to go in. Other jiras are already in. I think moving those commits to a
separate branch, adding mere two more commits in that branch, calling for merge
vote is unnecessary waste of time and I want to avoid it.

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs

[
https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613300#comment-13613300
]

Aaron T. Myers commented on HDFS-4489:
--

bq. One of the main reasons for a feature branch (at least for me), while
during development, we may break trunk. But in this case that is not the case.

At one point in time I believe the INodeID work did indeed break WebHDFS on
trunk.

Another reason for using a development branch is because the feature isn't
necessarily complete without certain patches having been committed. The fact
that HDFS-4339 (persist INodeIDs in the fsimage) isn't committed yet suggests
that this feature won't really work as-intended until that's committed, but yet
we've already committed other patches involving INodeIDs to trunk. That doesn't
make much sense to me.

I don't understand the resistance to doing this on a feature branch. What's the
concern with doing so?

Use InodeID as as an identifier of a file in HDFS protocols and APIs

Key: HDFS-4489
URL: https://issues.apache.org/jira/browse/HDFS-4489
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Brandon Li
Assignee: Brandon Li

[jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs