[jira] [Commented] (HDFS-6709) Implement off-heap data structures for NameNode and other HDFS memory optimization

2014-07-28 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076272#comment-14076272
 ] 

Daryn Sharp commented on HDFS-6709:
---

Yes, we definitely generate a lot of garbage per call.  Due to GC concerns, 
I've got work in progress to reduce the garbage generated which is why I'm 
concerned about even more garbage per call (inodes are repeatedly looked up far 
more than you think, working on single resolution).  We've already tuned young 
generation to be in-line with what other companies running large scale services 
use.

bq. Maybe you think I've chosen an easy example. Hmm... the operation that I 
can think of that touches the most inodes is recursive delete.

Yes, deletes of large trees is a good example but in practice it's a rare 
operation on a large tree.  However getContentSummary is run often for 
monitoring.  It may take many seconds on just a subtree of some clusters.  It 
may visit millions or tens of millions of inodes.

Many block level operation fetch the block collection which is really the 
inode.  Sometimes to verify the block isn't abandoned or to access other 
related blocks.  Decommissioning has always been a problem in general.  It will 
repeatedly crawl hundreds of thousands of blocks, each requiring a BC/inode 
lookup.  The replication monitor is likely to indirectly require the BC/inode 
too when it runs every 3s.  Refer to {{BlocksMap.getBlockCollection}} to see 
how many other places it's called.

Even the unobtainable best case of eliminating the 1.5s per 6h CMS pause at the 
expense of increasing the frequency and/duration of the ParNew pauses is a huge 
loss.  I suppose the proof is in a simulation.  Perhaps a rudimentary test is 
instantiating a garbage inode and blockinfo every time one is looked up, yet 
still return the real one, so we can how well ParNew handles the onslaught of 
garbage.

 Implement off-heap data structures for NameNode and other HDFS memory 
 optimization
 --

 Key: HDFS-6709
 URL: https://issues.apache.org/jira/browse/HDFS-6709
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-6709.001.patch


 We should investigate implementing off-heap data structures for NameNode and 
 other HDFS memory optimization.  These data structures could reduce latency 
 by avoiding the long GC times that occur with large Java heaps.  We could 
 also avoid per-object memory overheads and control memory layout a little bit 
 better.  This also would allow us to use the JVM's compressed oops 
 optimization even with really large namespaces, if we could get the Java heap 
 below 32 GB for those cases.  This would provide another performance and 
 memory efficiency boost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6709) Implement off-heap data structures for NameNode and other HDFS memory optimization

2014-07-27 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075894#comment-14075894
 ] 

Colin Patrick McCabe commented on HDFS-6709:


bq. I'm just asking leading questions to make sure this approach is sound. Y! 
stands to lose a lot if this doesn't actually scale

The questions are good... hopefully the answers are too!  I'm just trying to 
make my answers as complete as I can.

bq. To clarify the RTTI, I thought you meant more than just a per-instance 
reference to the class would be saved - although saving a reference is indeed 
great

Yeah.  It will shrink objects by 4 or 8 bytes each.  It's not immaterial!  
Savings like these are why I think it will shrink memory consumption

bq. Regarding atomicity/CAS, it's relevant because using misalignment 
(over-optimization?) prevents adding concurrency to data structures that aren't 
but should allow concurrency. I digress

Isn't this a minor implementation detail, though?  We don't currently use 
atomic ops on these data structures.  If we go ahead with a layout that uses 
unaligned access, and someone later decides to make things atomic, we can 
always switch to an aligned layout.

bq. I know about generational collection but I'm admittedly not an expert. 
Which young gen GC method does not pause? ParNew+CMS definitively pauses... 
Here are some quickly gathered 12-day observations from a moderately loaded, 
multi-thousand node, non-production cluster:

I'm not a GC expert either.  But from what I've read, does not pause is a 
pretty high bar to clear.  I think even Azul's GC pauses on occasion for 
sub-millisecond intervals.  For CMS and G1, everything I've read talks about 
tuning the young-gen collection in terms of target pause times.

bq. We have production clusters over 2.5X larger that sustained over 3X 
ops/sec. This non-prod cluster is generating ~625MB of garbage/sec. How do you 
predict dynamic instantiation of INode and BlockInfo objects will help? They 
generally won't be promoted to old gen which should reduce the infrequent CMS 
collection times. BUT, will it dramatically increase the frequency of young 
collection and/or lead to premature tenuring?

If you look at the code, we create temporary objects all over the place.

For example, look at setTimes:

{code}
  private void setTimesInt(String src, long mtime, long atime)
throws IOException, UnresolvedLinkException {
HdfsFileStatus resultingStat = null;
FSPermissionChecker pc = getPermissionChecker();
checkOperation(OperationCategory.WRITE);
byte[][] pathComponents = FSDirectory.getPathComponentsForReservedPath(src);
writeLock();
try {
  checkOperation(OperationCategory.WRITE);
  checkNameNodeSafeMode(Cannot set times  + src);
  src = FSDirectory.resolvePath(src, pathComponents, dir);

  // Write access is required to set access and modification times
  if (isPermissionEnabled) {
checkPathAccess(pc, src, FsAction.WRITE);
  }
  final INodesInPath iip = dir.getINodesInPath4Write(src);
  final INode inode = iip.getLastINode();
{code}

You can see we create:
HdfsFileStatus (with at least 5 sub-objects.  one of those, FsPermission, has 3 
sub-objects of its own)
FSPermissionChecker (which has at least 5 sub-objects inside it)
pathComponents
new src string
INodesInPath (at least 2 sub-objects of its own)

That's at least 21 temporary objects just in this code snippet, and I'm sure I 
missed a lot of things.  I'm not including any of the functions that called or 
were called by this function, or any of the RPC or protobuf machinations.  The 
average path depth is maybe between 5 and 8... would having 5 to 8 extra 
temporary objects to represent INodes we traversed substantially increase the 
GC load?  I would say no.

Maybe you think I've chosen an easy example.  Hmm... the operation that I can 
think of that touches the most inodes is recursive delete.  But we've known 
about the problems with this for a while... that's why JIRAs like HDFS-2938 
addressed the problem.  Arguably, an off-heap implementation is actually better 
here since we avoid creating a lot of trash in the tenured generation.  And 
trash in the tenured generation leads to heap fragmentations (at least in CMS), 
and the dreaded full GC.

 Implement off-heap data structures for NameNode and other HDFS memory 
 optimization
 --

 Key: HDFS-6709
 URL: https://issues.apache.org/jira/browse/HDFS-6709
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-6709.001.patch


 We should investigate implementing off-heap data structures for NameNode and 
 other HDFS memory 

[jira] [Commented] (HDFS-6709) Implement off-heap data structures for NameNode and other HDFS memory optimization

2014-07-25 Thread Kai Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074189#comment-14074189
 ] 

Kai Zheng commented on HDFS-6709:
-

I repeated the test in the post and sadly found it's true that DirectByteBuffer 
doesn't perform well as write. I'm communicating with Oracle about this and 
hopefully they could explain about it and address this in future Java version. 
It's interesting. Thanks.

 Implement off-heap data structures for NameNode and other HDFS memory 
 optimization
 --

 Key: HDFS-6709
 URL: https://issues.apache.org/jira/browse/HDFS-6709
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-6709.001.patch


 We should investigate implementing off-heap data structures for NameNode and 
 other HDFS memory optimization.  These data structures could reduce latency 
 by avoiding the long GC times that occur with large Java heaps.  We could 
 also avoid per-object memory overheads and control memory layout a little bit 
 better.  This also would allow us to use the JVM's compressed oops 
 optimization even with really large namespaces, if we could get the Java heap 
 below 32 GB for those cases.  This would provide another performance and 
 memory efficiency boost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6709) Implement off-heap data structures for NameNode and other HDFS memory optimization

2014-07-25 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075157#comment-14075157
 ] 

Daryn Sharp commented on HDFS-6709:
---

I sense a bit of condescension.  We're all friends here.  I'm just asking 
leading questions to make sure this approach is sound.  Y! stands to lose a lot 
if this doesn't actually scale.  To clarify the RTTI, I thought you meant more 
than just a per-instance reference to the class would be saved - although 
saving a reference is indeed great.  Regarding atomicity/CAS, it's relevant 
because using misalignment (over-optimization?) prevents adding concurrency to 
data structures that aren't but should allow concurrency.  I digress

The important point for discussion is this:
bq. No, because every modern GC uses generational collection. This means that 
short-lived instances are quickly cleaned up, without any pauses.

I know about generational collection but I'm admittedly not an expert.  Which 
young gen GC method does not pause?  ParNew+CMS definitively pauses...  Here 
are some quickly gathered 12-day observations from a moderately loaded, 
multi-thousand node, non-production cluster:
* ParNew collections:
** frequency: every ~4s
** range: 66ms - 468ms
** avg: 130ms
** collects: 2.5GB
* CMS old collections:
** frequency:  4 per day 
** range: 1.0s - 2.9s
** avg: 1.5s

We have production clusters over 2.5X larger that sustained over 3X ops/sec.  
This non-prod cluster is generating ~625MB of garbage/sec.  How do you predict 
dynamic instantiation of INode and BlockInfo objects will help?  They generally 
won't be promoted to old gen which should reduce the infrequent CMS collection 
times.  BUT, will it dramatically increase the frequency of young collection 
and/or lead to premature tenuring?

I'm honestly asking for design specifics of how this will help.  Are you 
positive we won't get a little by giving up a lot?  I'd love for my concerns to 
be unfounded.

 Implement off-heap data structures for NameNode and other HDFS memory 
 optimization
 --

 Key: HDFS-6709
 URL: https://issues.apache.org/jira/browse/HDFS-6709
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-6709.001.patch


 We should investigate implementing off-heap data structures for NameNode and 
 other HDFS memory optimization.  These data structures could reduce latency 
 by avoiding the long GC times that occur with large Java heaps.  We could 
 also avoid per-object memory overheads and control memory layout a little bit 
 better.  This also would allow us to use the JVM's compressed oops 
 optimization even with really large namespaces, if we could get the Java heap 
 below 32 GB for those cases.  This would provide another performance and 
 memory efficiency boost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6709) Implement off-heap data structures for NameNode and other HDFS memory optimization

2014-07-24 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073404#comment-14073404
 ] 

Daryn Sharp commented on HDFS-6709:
---

Questions/comments on the advantages:
* I thought RTTI is per class, not instance?  If yes, the savings are 
immaterial?
* Using misaligned access may result in processor incompatibility, impact 
performance, introduces atomicity and CAS problems, concurrent access to 
adjacent misaligned memory in the cache line may be completely unsafe.
* No references, only primitives can be stored off-heap, so how do value types 
(non-boxed primitives, correct?) apply?  Wouldn't the instance managing the 
slab have methods that return the correct primitive?

I think off-heap may be a win in some limited cases, but I'm struggling with 
how it will work in practice.  Here's thoughts for clarification on actual 
application of the technique:
# OO encapsulation and polymorphism are lost?
# We can't store references anymore so we're reduced to primitives?
# Let's say we used to have a class {{Foo}} with instance fields 
{{field1..field4}} of various types.  {{FooManager.get(id)}} returns a {{Foo}} 
instance.  But now a off-heap structure doesn't have any instantiated {{Foo}} 
entries else there is no GC benefit other than smaller instances to compact.
# Does {{FooManager}} instantiate new {{Foo}} instances every time 
{{FooManager.get(id)}} is called?  If yes, it generates a tremendous amount of 
garbage that defeats the GC benefit of going off heap.
# Does {{FooManager}} try to maintain a limited pool of mutable {{Foo}} objects 
for reuse (ex. via a {{Foo#reinitialize(id, f1..f4)}}?  (I've tried this 
technique elsewhere with degraded performance but maybe there's a good way to 
do)
# If no {{Foo}} entries are allowed:
## does {{FooManager}} have methods for every data member that used to be 
encapsulated by {{Foo}}?  Ie. {{FooManager.getField$N(id)}}?  We'll have to 
make N-many calls probably within a critical section?
## Will apis change from {{doSomething(Foo foo, String msg, boolean flag)}} to 
{{doSomething(Long fooId, int fooField1, long fooField2, boolean fooField3, 
long fooField4, String msg, boolean flag)}}?
## If we add another field, do we go back and update all the apis again?


 Implement off-heap data structures for NameNode and other HDFS memory 
 optimization
 --

 Key: HDFS-6709
 URL: https://issues.apache.org/jira/browse/HDFS-6709
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-6709.001.patch


 We should investigate implementing off-heap data structures for NameNode and 
 other HDFS memory optimization.  These data structures could reduce latency 
 by avoiding the long GC times that occur with large Java heaps.  We could 
 also avoid per-object memory overheads and control memory layout a little bit 
 better.  This also would allow us to use the JVM's compressed oops 
 optimization even with really large namespaces, if we could get the Java heap 
 below 32 GB for those cases.  This would provide another performance and 
 memory efficiency boost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6709) Implement off-heap data structures for NameNode and other HDFS memory optimization

2014-07-24 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073529#comment-14073529
 ] 

Colin Patrick McCabe commented on HDFS-6709:


bq. I thought RTTI is per class, not instance? If yes, the savings are 
immaterial?

RTTI has to be per-instance.  That is why you can pass around Object instances 
and cast them to whatever you want.  Java has to store this information 
somewhere (think about it).  If Java didn't store this, it would have no way to 
know whether the cast should succeed or not.  Then you would be in the same 
situation as in C, where you can cast something to something else and get 
random garbage bits.

bq. Using misaligned access may result in processor incompatibility, impact 
performance, introduces atomicity and CAS problems, concurrent access to 
adjacent misaligned memory in the cache line may be completely unsafe.

I know about alignment restrictions.  There are easy ways around that problem-- 
instead of getLong you use two getShort calls, etc., depending on the minimum 
alignment you can rely on.  I don't see how CAS or atomicity are relevant, 
since we're not discussing atomic data structures.  The performance benefits of 
storing less data can often cancel out the performance disadvantages of doing 
unaligned access.  It depends on the scenario.

bq. No references, only primitives can be stored off-heap, so how do value 
types (non-boxed primitives, correct?) apply? Wouldn't the instance managing 
the slab have methods that return the correct primitive?

The point is that with control over the layout, you can do better.  I guess a 
more concrete example might help explain this.

bq. OO encapsulation and polymorphism are lost?

Take a look at {{BlockInfo#triplets}}.  How much OO encapsulation do you see in 
an array of Object[], with a special comment above about how to interpret each 
set of three entries?  Most of the places we'd like to use off-heap storage are 
already full of hacks to abuse the Java type system to squeeze in a few extra 
bytes.  Arrays of primitives, arrays of objects, with special conventions are 
routine.

bq. Does FooManager instantiate new Foo instances every time FooManager.get(id) 
is called? If yes, it generates a tremendous amount of garbage that defeats the 
GC benefit of going off heap.

No, because every modern GC uses generational collection.  This means that 
short-lived instances are quickly cleaned up, without any pauses.

The rest of the questions seem to be variants on this one.  Think about it.  
All the code we have in FSNamesystem follows the pattern: lookup inode, do 
something to inode, done with inode.  We can create temporary INode objects and 
they'll never make it to PermGen, since they don't stick around between RPC 
calls.  Even if they somehow did (how?) with a dramatically smaller heap, the 
full GC would no longer be scary.  And we'd get other performance benefits like 
the compressed oops optimizations.  Anyway, the termporary inode objects would 
probably just be a thin objects which contain an offheap memory reference and a 
bunch of getters/setters, to avoid doing a lot of unnecessary serde.

 Implement off-heap data structures for NameNode and other HDFS memory 
 optimization
 --

 Key: HDFS-6709
 URL: https://issues.apache.org/jira/browse/HDFS-6709
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-6709.001.patch


 We should investigate implementing off-heap data structures for NameNode and 
 other HDFS memory optimization.  These data structures could reduce latency 
 by avoiding the long GC times that occur with large Java heaps.  We could 
 also avoid per-object memory overheads and control memory layout a little bit 
 better.  This also would allow us to use the JVM's compressed oops 
 optimization even with really large namespaces, if we could get the Java heap 
 below 32 GB for those cases.  This would provide another performance and 
 memory efficiency boost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6709) Implement off-heap data structures for NameNode and other HDFS memory optimization

2014-07-24 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073677#comment-14073677
 ] 

Andrew Purtell commented on HDFS-6709:
--

bq. No, because every modern GC uses generational collection. This means that 
short-lived instances are quickly cleaned up, without any pauses.

... and modern JVM versions have escape analysis enabled by default. Although 
there are limitations, simple objects that don't escape the local block (like 
Iterators) or the method can be allocated on the stack once native code is 
emitted by the server compiler. No heap allocation happens at all. You can use 
fastdebug JVM builds during dev to learn explicitly what your code is doing in 
this regard.

 Implement off-heap data structures for NameNode and other HDFS memory 
 optimization
 --

 Key: HDFS-6709
 URL: https://issues.apache.org/jira/browse/HDFS-6709
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-6709.001.patch


 We should investigate implementing off-heap data structures for NameNode and 
 other HDFS memory optimization.  These data structures could reduce latency 
 by avoiding the long GC times that occur with large Java heaps.  We could 
 also avoid per-object memory overheads and control memory layout a little bit 
 better.  This also would allow us to use the JVM's compressed oops 
 optimization even with really large namespaces, if we could get the Java heap 
 below 32 GB for those cases.  This would provide another performance and 
 memory efficiency boost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6709) Implement off-heap data structures for NameNode and other HDFS memory optimization

2014-07-24 Thread Kai Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074027#comment-14074027
 ] 

Kai Zheng commented on HDFS-6709:
-

bq.  Sadly, while investigating off heap performance last fall, I found this 
article that claims off-heap reads via a DirectByteBuffer have horrible 
performance
I just took a look at the post. Yes it claimed DirectByteBuffer has the same 
great write performance with Unsafe, but the read performance is horrible. Why, 
the reason isn't clear yet. Look at the following code from JRE, there seems to 
be no big difference between read and write in DirectByteBuffer:
{code}
public byte get() {
return ((unsafe.getByte(ix(nextGetIndex();
}
{code}
{code}
public ByteBuffer put(byte x) {
unsafe.putByte(ix(nextPutIndex()), ((x)));
return this;
}
{code}
Questions here: 1) why read performs so bad than write if it's true? 2) Is it 
true that simply adding the index check would cause big performance loss? 
Some tests would be needed to make sure DirectByteBuffer is good enough meeting 
the needs here even in performance consideration, and the performance should be 
compared apple to apple exactly in the cases here.

 Implement off-heap data structures for NameNode and other HDFS memory 
 optimization
 --

 Key: HDFS-6709
 URL: https://issues.apache.org/jira/browse/HDFS-6709
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-6709.001.patch


 We should investigate implementing off-heap data structures for NameNode and 
 other HDFS memory optimization.  These data structures could reduce latency 
 by avoiding the long GC times that occur with large Java heaps.  We could 
 also avoid per-object memory overheads and control memory layout a little bit 
 better.  This also would allow us to use the JVM's compressed oops 
 optimization even with really large namespaces, if we could get the Java heap 
 below 32 GB for those cases.  This would provide another performance and 
 memory efficiency boost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6709) Implement off-heap data structures for NameNode and other HDFS memory optimization

2014-07-23 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071964#comment-14071964
 ] 

Daryn Sharp commented on HDFS-6709:
---

If {{Unsafe}} is being removed then I don't think we should create a dependency 
on it.  Sadly, while investigating off heap performance last fall, I found this 
article that claims off-heap reads via a {{DirectByteBuffer}} have *horrible* 
performance:

http://www.javacodegeeks.com/2013/08/which-memory-is-faster-heap-or-bytebuffer-or-direct.html

bq. With a hash table and a linked list, we could probably start off-heaping 
things such as the triplets array in the BlockInfo object.

How you do envision off-heaping triplets in conjunction with those collections? 
 Linked list entries cost 48 bytes on a 64-bit jvm.  A hash table entry costs 
52 bytes.  I know your goal is reduced GC while ours is reduced memory usage, 
so it'll be unacceptable if an off-heap implementation consumes even more 
memory - which incidentally will require GC and may cancel any off-heap 
benefit?  And/or cause a performance degradation.

 Implement off-heap data structures for NameNode and other HDFS memory 
 optimization
 --

 Key: HDFS-6709
 URL: https://issues.apache.org/jira/browse/HDFS-6709
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-6709.001.patch


 We should investigate implementing off-heap data structures for NameNode and 
 other HDFS memory optimization.  These data structures could reduce latency 
 by avoiding the long GC times that occur with large Java heaps.  We could 
 also avoid per-object memory overheads and control memory layout a little bit 
 better.  This also would allow us to use the JVM's compressed oops 
 optimization even with really large namespaces, if we could get the Java heap 
 below 32 GB for those cases.  This would provide another performance and 
 memory efficiency boost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6709) Implement off-heap data structures for NameNode and other HDFS memory optimization

2014-07-23 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072197#comment-14072197
 ] 

Colin Patrick McCabe commented on HDFS-6709:


If Unsafe is removed, then we'll work around it the same way we work around 
lack of symlink or hardlink support, missing error information from mkdir, etc. 
 As you can see in this patch, we don't need Unsafe, we just use it because 
it's faster.  I would assume that if Unsafe is removed, there will be work on 
improving DirectByteBuffer and JNI performance or putting in place other 
alternate APIs that allow Java to function effectively on the server.  
Otherwise, the future of the platform doesn't look good.  Even Haskell has an 
Unsafe package.

bq. How you do envision off-heaping triplets in conjunction with those 
collections? Linked list entries cost 48 bytes on a 64-bit jvm. A hash table 
entry costs 52 bytes. I know your goal is reduced GC while ours is reduced 
memory usage, so it'll be unacceptable if an off-heap implementation consumes 
even more memory - which incidentally will require GC and may cancel any 
off-heap benefit? And/or cause a performance degradation.

With off-heap objects, the sizes can be whatever we want.  I think a basic 
linked list entry would be 16 bytes (two 8-byte prev and next pointers), plus 
the size of the payload.  A hash table entry has no real minimum size, since 
again, it's just a memory region that contains whatever we want.  We will be 
able to do a lot better than the JVM because of a few things:
* the jvm must store runtime type information (RTTI) for each object, and we 
won't
* the 64-bit jvm usually aligns to 8 bytes, but we don't have to
* we don't have to implement a lock bit, or any of that
* we can use value types, and current JVMs can't (although future ones will be 
able to)
* the JVM doesn't know that you will create 1 million of an object; it just 
creates a generic object layout that must balance access speed and object size. 
 Since we know, we can be more clever.

 Implement off-heap data structures for NameNode and other HDFS memory 
 optimization
 --

 Key: HDFS-6709
 URL: https://issues.apache.org/jira/browse/HDFS-6709
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-6709.001.patch


 We should investigate implementing off-heap data structures for NameNode and 
 other HDFS memory optimization.  These data structures could reduce latency 
 by avoiding the long GC times that occur with large Java heaps.  We could 
 also avoid per-object memory overheads and control memory layout a little bit 
 better.  This also would allow us to use the JVM's compressed oops 
 optimization even with really large namespaces, if we could get the Java heap 
 below 32 GB for those cases.  This would provide another performance and 
 memory efficiency boost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6709) Implement off-heap data structures for NameNode and other HDFS memory optimization

2014-07-22 Thread Kai Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070021#comment-14070021
 ] 

Kai Zheng commented on HDFS-6709:
-

Hi Colin,

I learned the patch. The Slab stuffs look great. A question, look at the 
following codes I doubt if we need the path to call unsafe.getByte(). Either 
allocated from direct buffer or heap buffer, ByteBuffer interface can be used 
to get/set byte from/to the buffer. Note it would be good to avoid using Unsafe 
if possible. Please clarify if I misunderstand anything here, thanks.

{code}
+  byte getByte(int offset) {
+if (base != 0) {
+  return NativeIO.getUnsafe().getByte(null, base + offset);
+} else {
+  buf.position(offset);
+  return buf.get();
+}
+  }
{code}


 Implement off-heap data structures for NameNode and other HDFS memory 
 optimization
 --

 Key: HDFS-6709
 URL: https://issues.apache.org/jira/browse/HDFS-6709
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-6709.001.patch


 We should investigate implementing off-heap data structures for NameNode and 
 other HDFS memory optimization.  These data structures could reduce latency 
 by avoiding the long GC times that occur with large Java heaps.  We could 
 also avoid per-object memory overheads and control memory layout a little bit 
 better.  This also would allow us to use the JVM's compressed oops 
 optimization even with really large namespaces, if we could get the Java heap 
 below 32 GB for those cases.  This would provide another performance and 
 memory efficiency boost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6709) Implement off-heap data structures for NameNode and other HDFS memory optimization

2014-07-22 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070637#comment-14070637
 ] 

Colin Patrick McCabe commented on HDFS-6709:


The {{ByteBuffer}} interface should be slower, since it needs to update offsets 
and do boundary checking.  The {{ByteBuffer}} stuff is just a fallback if the 
JVM doesn't support direct buffers (although I'm not sure how many such JVMs 
are left in the wild).  It also might be helpful for debugging.

 Implement off-heap data structures for NameNode and other HDFS memory 
 optimization
 --

 Key: HDFS-6709
 URL: https://issues.apache.org/jira/browse/HDFS-6709
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-6709.001.patch


 We should investigate implementing off-heap data structures for NameNode and 
 other HDFS memory optimization.  These data structures could reduce latency 
 by avoiding the long GC times that occur with large Java heaps.  We could 
 also avoid per-object memory overheads and control memory layout a little bit 
 better.  This also would allow us to use the JVM's compressed oops 
 optimization even with really large namespaces, if we could get the Java heap 
 below 32 GB for those cases.  This would provide another performance and 
 memory efficiency boost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6709) Implement off-heap data structures for NameNode and other HDFS memory optimization

2014-07-22 Thread Kai Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071222#comment-14071222
 ] 

Kai Zheng commented on HDFS-6709:
-

bq. The {{ByteBuffer}} interface should be slower, since it needs to update 
offsets and do boundary checking.
Agree. But it still meets the primary goal here, i.e. putting data off heap.

I'm wondering if using Unsafe can be avoided here or not. There were some 
discussions with Oracle related to this and we were updated that, JDK9 is 
likely to block all accesses to sun.* classes.*. Therefore we might need to 
clean up such calls before that and avoid new ones like here. Regarding Unsafe 
situation, you might look at the following doc (provided by Max from Oracle), 
and give your insights. Thanks.
http://cr.openjdk.java.net/~psandoz/dv14-uk-paul-sandoz-unsafe-the-situation.pdf

 Implement off-heap data structures for NameNode and other HDFS memory 
 optimization
 --

 Key: HDFS-6709
 URL: https://issues.apache.org/jira/browse/HDFS-6709
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HDFS-6709.001.patch


 We should investigate implementing off-heap data structures for NameNode and 
 other HDFS memory optimization.  These data structures could reduce latency 
 by avoiding the long GC times that occur with large Java heaps.  We could 
 also avoid per-object memory overheads and control memory layout a little bit 
 better.  This also would allow us to use the JVM's compressed oops 
 optimization even with really large namespaces, if we could get the Java heap 
 below 32 GB for those cases.  This would provide another performance and 
 memory efficiency boost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)