[jira] [Commented] (HDFS-6709) Implement off-heap data structures for NameNode and other HDFS memory optimization

Colin Patrick McCabe (JIRA) Sun, 27 Jul 2014 21:50:26 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075894#comment-14075894
 ]


Colin Patrick McCabe commented on HDFS-6709:
--------------------------------------------

bq. I'm just asking leading questions to make sure this approach is sound. Y! 
stands to lose a lot if this doesn't actually scale

The questions are good... hopefully the answers are too!  I'm just trying to 
make my answers as complete as I can.

bq. To clarify the RTTI, I thought you meant more than just a per-instance 
reference to the class would be saved - although saving a reference is indeed 
great

Yeah.  It will shrink objects by 4 or 8 bytes each.  It's not immaterial!  
Savings like these are why I think it will shrink memory consumption

bq. Regarding atomicity/CAS, it's relevant because using misalignment 
(over-optimization?) prevents adding concurrency to data structures that aren't 
but should allow concurrency. I digress....

Isn't this a minor implementation detail, though?  We don't currently use 
atomic ops on these data structures.  If we go ahead with a layout that uses 
unaligned access, and someone later decides to make things atomic, we can 
always switch to an aligned layout.

bq. I know about generational collection but I'm admittedly not an expert. 
Which young gen GC method does not pause? ParNew+CMS definitively pauses... 
Here are some quickly gathered 12-day observations from a moderately loaded, 
multi-thousand node, non-production cluster:

I'm not a GC expert either.  But from what I've read, "does not pause" is a 
pretty high bar to clear.  I think even Azul's GC pauses on occasion for 
sub-millisecond intervals.  For CMS and G1, everything I've read talks about 
tuning the young-gen collection in terms of target pause times.

bq. We have production clusters over 2.5X larger that sustained over 3X 
ops/sec. This non-prod cluster is generating ~625MB of garbage/sec. How do you 
predict dynamic instantiation of INode and BlockInfo objects will help? They 
generally won't be promoted to old gen which should reduce the infrequent CMS 
collection times. BUT, will it dramatically increase the frequency of young 
collection and/or lead to premature tenuring?

If you look at the code, we create temporary objects all over the place.

For example, look at setTimes:

{code}
  private void setTimesInt(String src, long mtime, long atime)
    throws IOException, UnresolvedLinkException {
    HdfsFileStatus resultingStat = null;
    FSPermissionChecker pc = getPermissionChecker();
    checkOperation(OperationCategory.WRITE);
    byte[][] pathComponents = FSDirectory.getPathComponentsForReservedPath(src);
    writeLock();
    try {
      checkOperation(OperationCategory.WRITE);
      checkNameNodeSafeMode("Cannot set times " + src);
      src = FSDirectory.resolvePath(src, pathComponents, dir);

      // Write access is required to set access and modification times
      if (isPermissionEnabled) {
        checkPathAccess(pc, src, FsAction.WRITE);
      }
      final INodesInPath iip = dir.getINodesInPath4Write(src);
      final INode inode = iip.getLastINode();
{code}

You can see we create:
HdfsFileStatus (with at least 5 sub-objects.  one of those, FsPermission, has 3 
sub-objects of its own)
FSPermissionChecker (which has at least 5 sub-objects inside it)
pathComponents
new src string
INodesInPath (at least 2 sub-objects of its own)

That's at least 21 temporary objects just in this code snippet, and I'm sure I 
missed a lot of things.  I'm not including any of the functions that called or 
were called by this function, or any of the RPC or protobuf machinations.  The 
average path depth is maybe between 5 and 8... would having 5 to 8 extra 
temporary objects to represent INodes we traversed substantially increase the 
GC load?  I would say no.

Maybe you think I've chosen an easy example.  Hmm... the operation that I can 
think of that touches the most inodes is recursive delete.  But we've known 
about the problems with this for a while... that's why JIRAs like HDFS-2938 
addressed the problem.  Arguably, an off-heap implementation is actually better 
here since we avoid creating a lot of trash in the tenured generation.  And 
trash in the tenured generation leads to heap fragmentations (at least in CMS), 
and the dreaded full GC.

> Implement off-heap data structures for NameNode and other HDFS memory 
> optimization
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-6709
>                 URL: https://issues.apache.org/jira/browse/HDFS-6709
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-6709.001.patch
>
>
> We should investigate implementing off-heap data structures for NameNode and 
> other HDFS memory optimization.  These data structures could reduce latency 
> by avoiding the long GC times that occur with large Java heaps.  We could 
> also avoid per-object memory overheads and control memory layout a little bit 
> better.  This also would allow us to use the JVM's "compressed oops" 
> optimization even with really large namespaces, if we could get the Java heap 
> below 32 GB for those cases.  This would provide another performance and 
> memory efficiency boost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6709) Implement off-heap data structures for NameNode and other HDFS memory optimization

Reply via email to