Jonathan Packer created PIG-3429:
------------------------------------

             Summary: Reduce Pig memory footprint using specialized tuple 
classes (complementary to SchemaTuple)
                 Key: PIG-3429
                 URL: https://issues.apache.org/jira/browse/PIG-3429
             Project: Pig
          Issue Type: Improvement
          Components: data
    Affects Versions: 0.12
            Reporter: Jonathan Packer


Pig's default tuple implementation is very memory inefficient for small tuples, 
as the minimum size of an empty tuple is 96 bytes. This leads to bags being 
spilled more often than they need to. SchemaTuple addresses this, but is not 
fully integrated into the PhysicalPlan pipeline (and seems like it would be 
difficult to do so). Furthermore, it is likely that almost all UDFs do not use 
SchemaTuple.

This patch therefore provides some basic optimizations to reduce memory 
footprint of tuples by having BinSedesTupleFactory construct specialized tuple 
implementations in certain circumstances. This way, anything using 
BinSedesTupleFactory will reap the benefits, and since SchemaTuple uses a 
different factory, it will not be interfered with.

There is a long description below, because this patch might break stuff. I 
tried to think through possible implementation hazards which I will list.

The specialized tuple implementations are as follows:

EmptyTuple          // no fields, just an object header = 8 bytes
NullWrapperTuple    // wraps a single null field, 8 bytes
CountingTuple       // replaces (1L) as initial output of COUNT, 8 bytes

IntegerWrapperTuple // these all wrap a single primitive field
LongWrapperTuple    // object header + rounded primitive size = 16 bytes
FloatWrapperTuple
DoubleWrapperTuple

BinSedesTuple2      // these are pair/triples of fields with no ArrayList
BinSedesTuple3      // 16/24 bytes of overhead as opposed to 80 from ArrayList

The memory savings are greatest for the algebraic math functions COUNT, SUM, 
etc. For example, the size of an intermediate tuple for COUNT should go from 
112 bytes to 8 bytes. The size of an intermediate tuple from SUM should go from 
112 bytes to 16 bytes.

I haven't finished running the full unit-tests, but TestAlgebraicEval passes so 
I'm hopeful it will be manageable to debug.

The three concerns that I have are:
1) Since TupleFactory now sometimes outputs non-appendable tuples, the 
isFixedSize() method had to be removed. A file search didn't show it being used 
anywhere though. I think appending to tuples instead of finding out the 
requisite size ahead of time is bad practice as well (I changed POForeach to do 
the latter so it can take advantage of the special tuple impls).
2) Also since TupleFactory now has multiple tuple types, the tupleClass() 
method gets tricky. I made a superclass GenericBinSedesTuple that all the 
specialized classes inherit from, and it seems to work, but I'm not sure what 
the implications of this are. It breaks the inheritance tree of AbstractTuple 
<-- DefaultTuple <-- BinSedesTuple, so now "DefaultBinSedesTuple" inherits 
directly from GenericBinSedesTuple and DefaultTuple is left unused. In the 
patch, all the stuff for DefaultBinSedesTuple is just copied over from the old 
DefaultTuple.
3) I tried to be careful not to break BinInterSedesTupleRawComparator, but this 
will need verification.

Finally,
4) For my personal use cases, I'd like to make custom tuple implementations 
like SparseMatrixTuple or FeatureVectorTuple. Would people be opposed to making 
some "hooks" in BinInterSedes for user-defined tuple types? I was thinking 
there could be some config which maps these hooks (data type bytes) to 
user-defined classes and uses reflection to instantiate and read them. Not sure 
if that would be performant though.

Thanks for reading all that!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to