Re: A proposal for changing pig's memory management

Mridul Muralidharan Mon, 01 Jun 2009 11:57:09 -0700

Alan Gates wrote:

On May 19, 2009, at 10:30 PM, Mridul Muralidharan wrote:
I am still not very convinced about the value about thisimplementation - particularly considering the advances made since 1.3in memory allocators and garbage collection.
My fundamental concern is not with the slowness of garbage collection.I am asserting (along with the paper) that garbage collection is not anoptimal choice for a large data processing system. I don't want toimprove the garbage collector, I want to manage a subset of the memorywithout it.



I should probably have elaborated better.

Most objects in Pig are in young generation (pls correct me if I amwrong) - so promoting them from there (which is handled pretty optimallyand blazingly fast by vm) into slower/longer memory pools should be donewith some thought (management of buffers, etc).

The only (corner) cases where this is not valid, from top of my head, iswhen a single tuple becomes really large due to a bag (usually) witheither large number of tuples in it, or tuples with larger payloads :and imo that results in quite similar costs with this proposal too - butI could be wrong.

The side effect of this proposal is many, and sometimes non-obvious.
Like implicitly moving young generation data into older generation,causing much more memory pressure for gc, fragmentation of memoryblocks causing quite a bit of memory pressure, replicating quite a bitof functionality with garbage collection, possibility of bugs with refcounting, etc.
I don't understand your concerns regarding the load on the gc and memoryfragmentation. Let's say I have 10,000 tuples, each with 10 fields.Let's also assume that these tuples live long enough to make it into the"old" memory pool, since this is the interesting case where objects livelong enough to cause a problem. In the current implementation therewill be 110,000 objects that the gc has to manage moving into the oldpool, and check every time it cleans the old pool. In the proposedimplementation there would be 10,001 objects (assuming all the data fitinto one buffer) to manage. And rather than allocating 100,000 smallpieces of memory, we would have allocated one large segment. My beliefis that this would lighten the load on the gc.



Old gen memory management is not very trivial.

For example, which should probably be very commonly known now - if anold block is freed and yet the cost of moving existing blocks around touse the 'free' block is high, vm just leaves it around. Over time, youwill end up with fragmentation on old gen which cant be freed. (This isnot a vm bug - the costs outweigh the benefits).

That being said, as I mentioned above, the costs of mem usage is notlinear - young gen is way faster (allocation, management, free) thanobjects promoted to older generations (successively) [compaction,reference changes, etc in gc].

In pig's case, since it is essentially streaming in nature - mosttuples/bag - except in corner cases, would fall into young gen wherethings are faster.






Just a note though -

The last time I had to dabble in memory management for my server needs,it was already pretty complex and un-intutive (not to mention env andimpl specific) - and that was a few years back - unfortunately, I havenot kept abreast with recent changes (and quite a few have gone into vmfor java 6 I was told) : so probably my comments above might not bevalid anymore.Other than saying you would probably want to test extensively like wehad to do, and that things are not as simple as they normally appear[and imo almost all books/articles get it wrong - so testing is only wayout], I cant really comment more authoritatively anymore :-) Anyimprovement to pig memory management would be a welcome change though !




Regards,
Mridul

This does replicate some of the functionality of the garbagecollector. Complex systems frequently need to re-implement foundationalfunctionality in order to optimize it for their needs. Hence many RDBMSengines have their own implementations of memory management, file I/O,thread scheduling, etc.
As for bugs in ref counting, I agree that forgetting to deallocate isone of the most pernicious problems of allowing programmers to do memorymanagement. But in this case all that will happen is that a buffer willget left around that isn't needed. If the system needs more memory thenthat buffer will eventually get selected for flushing to disk, and thenit will stay there as no one will call it back into memory. So the costof forgetting to deallocate is minor.
If assumption that current working set of bag/tuple does not need tobe spilled, and anything else can be, then this will pretty muchdeteriorate to current impl in worst case.
That is not the assumption. There are two issues: 1) trying to spillbags only when we determine we need to is highly error prone, because wecan't accurately determine when we need to and because we sometimescan't dump fast enough to survive; 2) current memory usage is far toohigh, and needs to be reduced.
A much more simpler method to gain benefits would be to handleprimitives as ... primitives and not through the java wrapper classesfor them.It should be possible to write schema aware tuples which make use ofthe primitives specified to take a fraction of memory required (4bytes + null_check boolean for int + offset mapping instead of 24/32bytes it currently is, etc).
In my observation, at least 50% of the data in pig is untyped, whichmeans it's a byte array. Of the 50% that people declare or isdetermined by the program, probably 50-80% of that are chararrays andmaps. So that means that somewhere around <25% of the data is numeric.Shrinking that 25% by 75% will be nice, but not adequate. And it doesnothing to help with the issue of being able to spill in a controlledway instead of only in emergency situations.
Alan.

Re: A proposal for changing pig's memory management

Reply via email to