[jira] [Commented] (PIG-2632) Create a SchemaTuple which generates efficient Tuples via code gen

[email protected] (Commented) (JIRA) Thu, 05 Apr 2012 23:45:05 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248133#comment-13248133
 ]

[email protected] commented on PIG-2632:
----------------------------------------------------

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > 
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java,
 line 133
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line133>
bq.  >
bq.  >     it would be nice if Pig called outputSchema() once during the init 
and that was saved somewhere.

It calls it many times, alas. I feel like changing that is a job for another 
patch, but I could tackle that here. But given there is already the expectation 
that it will be called an arbitrary number of times, I don't think that calling 
it once more on the front end will be an issue?

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > 
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java,
 lines 136-145
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line136>
bq.  >
bq.  >     So the side effect is that the Tuple class get generated here ? This 
should be encapsulated in a more obvious method here.
bq.  >     like: generateTupleForSchema(tmpS) and not necessarily create an 
instance of a tuple.
bq.  >     also that would remove duplication here.

I was trying to leverage Dmitriy's newTupleForSchema, but I think you're right.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > 
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java,
 lines 151-152
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line151>
bq.  >
bq.  >     those are the inputSchema and the outputSchema. The fields should 
probably called just that and have boolean flag to know if we use generated 
Tuples?

Well, we need to know what the tuple to generate is. I will change it save the 
ID that serves as the base, but as is the Schema info isn't saved and it isn't 
available.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > 
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java,
 line 188
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line188>
bq.  >
bq.  >     I would make it:
bq.  >     TupleFactory.geInstanceForSchema().newTuple()

I think it's worth thinking about how we want to do this. Dmitriy implemented 
newTupleForSchema for the PrimitiveTuple work, and he suggested I use that for 
this... however I agree, I think there should be a "SchemaTupleFactory" that 
generates SchemaTuples of a given SchemaTuple. At the same time, that might be 
a lot of scaffolding for not a lot of gain: once you generate the code for a 
given SchemaTuple, there's not a lot of work to be done (though this could 
encapsulate some of the static maps that come later). One other factor is that 
the TupleFactory interface doesn't really lend itself very nicely to the 
SchemaTuple, but I do think it could be extended and be made to work. Would 
appreciate more of your thoughts on this.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/BinInterSedes.java, lines 601-602
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100076#file100076line601>
bq.  >
bq.  >     should SchemaTuple extend TypeAwareTuple ?

Agreed, and my working version has that update.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/PrimitiveTuple.java, line 213
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100077#file100077line213>
bq.  >
bq.  >     BytesKey ?

good call

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 46-48
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line46>
bq.  >
bq.  >     statics on top

I really need to clean up where everything goes, noted.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 70
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line70>
bq.  >
bq.  >     largestSetValue should already be correct. Why does largestSetValue 
stays at the size of the schema?

I guess once it is at the value of the SchemaTuple (excluding appends), there's 
no need to update it.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 77
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line77>
bq.  >
bq.  >     isAppendNull()
bq.  >     Should the append...() methods be public?

I think a lot of these methods should be protected, agreed

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 81
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line81>
bq.  >
bq.  >     the result would be the same, right?

yeah, you're right

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 94
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line94>
bq.  >
bq.  >     setAppend()

yeah need to rename all of these. naming is what I am worst at :S

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 105
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line105>
bq.  >
bq.  >     if it is always overridden, it should be abstract

It's overriden, but the overriden version calls super. Right now it returns 0 
because I don't want to take the time to calculate it out while the class is 
still evolving.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 119
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line119>
bq.  >
bq.  >     set? stream?

This originally meant "Schema Tuple Element Write" but yeah, it should be more 
descriptive.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 125
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line125>
bq.  >
bq.  >     redundant with the 2 others?
bq.  >     idClassMap.get(schemaIdMap.get(schema))

Perhaps this is what a SchemaTupleFactory would make cleaner, but my desire is 
for Tuple creation to be as fast as possible. This is why there are 5 maps... 
my thought is that they are all static, so the extra cost of caching the extra 
info is worth avoiding two gets in a row to link the info together when we need 
to make Tuples. Am open to thoughts.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 130-142
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line130>
bq.  >
bq.  >     could be computed statically by the generated class if you make a 
static getSchema(schemaString).

Well, my thought is that a static cache wouldn't take up much memory, and it 
would avoid having to recreate the Schema object every time getSchema() is 
called. I don't forsee it being called a ton, but possibly enough that 
recreating the object a bunch would be wasteful. Perhaps this is premature 
optimization?

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 144
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line144>
bq.  >
bq.  >     setAll() ?

It's not really a setAll imho. It's setting it equal to. Maybe that's what 
setAll means, but it seems to convey something different.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 159
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line159>
bq.  >
bq.  >     do we need all of set(Tuple), set(List), setEqualTo(List), 
setEqualTo(Tuple) ?

My working version adds even more. I think a lot of it should be made 
protected, or at least, I should be more thoughtful about what it should be. I 
think "set" is probably enough, but perhaps it should just be the void version? 
I guess this is where working with ruby where everything generally returns 
"this" by default, thus allowing for nice chaining of methods, is the norm.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 191
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line191>
bq.  >
bq.  >     you could add a properties file to the jar with this info. Or a 
generated factory that knows all the classes generated. You mainly need the 
mapping schema -> id
bq.  >     
bq.  >     you could also use true. Although that would not be THE answer to 
THE question.

I think the jar properties is the way to go, at least as the ultimate source of 
the classes that were generated. Alternately, it'd be nice if there was an easy 
way to browse a jar for certain class files, but I couldn't find a clean one? 
If there was, we could just search for SchemaTuple_*.class and keep track of 
which files went across. Do you know if there is an easier way to search jars 
like that? I can dig around as well.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 209
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line209>
bq.  >
bq.  >     As we have schema aware Tuples, in the future we could add 
get(String fieldName). We may not want to strip the aliases to reuse the class.

Hmm, it seems pretty silly that two Schemas that are essentially "int,int,long" 
except with different names would necessitate a different class. More 
importantly, however, I ran into cases where schema names were different due to 
the way that names were generated by different pieces of code, and so the 
lookup was thwarted if you were strict about names. Hmm hmm hmm. Do you think 
the win from get(String) is worth the potential complication? It could be 
convenient...

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 270-273
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line270>
bq.  >
bq.  >     Logger

woops, I usually mark these //remove and whack them before the patch goes out...

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 268
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line268>
bq.  >
bq.  >     there's only one code path where updateMaps is true. It could be 
simplified a bit.

agreed

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 310-321
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line310>
bq.  >
bq.  >     refactor DataByteArray to provide this?

So here is the thing about this. Currently, a ton of the useful methods in 
BinInterSedes (where this logic was taken from) is private. Also, they aren't 
static (though there's no real reason why they shouldn't be?) This snippet (and 
other logic like it) should probably be made into a protected static method of 
BinInterSedes. Thoughts? It could also make sense for classes to have a static 
method to serialize themselves, but I'd argue that the BinInterSedes approach 
is probably more consistent with how pig is laid out (though I have no idea why 
most of the methods there are private instead of protected static).

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 325-333
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line325>
bq.  >
bq.  >     Same comment, the String serialization should be shared with the 
regular Tuple.
bq.  >     
bq.  >     Also we could keep the a UTF8 in the SchemaTuple to reduce memory 
usage and convert lazilyto String (possibly caching the result)

This logic is stolen directly from BinInterSedes (see comment above). Can you 
flesh out what you mean about UTF8? If we change it here we should probably 
change it there as well (and make that the canonical protected static source of 
all your string serialization needs).

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 344-353
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line344>
bq.  >
bq.  >     same comment as before.
bq.  >     The same logic should be use with regular tuples.
bq.  >     The same logic should be shared across types.
bq.  >     
bq.  >     It could also use a variable size int format.
bq.  >     https://developers.google.com/protocol-buffers/docs/encoding#types

varints is a good idea. general question of where this logic should live still 
holds as well, as this logic is also directly ripped from BinInterSedes.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 452
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line452>
bq.  >
bq.  >     is (null, null, null) the same as null ?

In DefaultTuple, this is deprecated and not really used. Given that, I use this 
to basically mean "has any information been written to this." So in your case, 
both would be null. The more important question is to make sure I use it 
consistently throughout. Honestly, the whole null thing is a pain, and I need 
to really comb over things to make sure I incorporated it properly.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 474
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line474>
bq.  >
bq.  >     this.set(t) ?

reference is this weird Franken-method that isn't used anywhere in the 
codebase. I don't know that I want to implement it when the semantics of what 
it should do don't seem clear at all. Open to thoughts on this though. The 
original intent of it doesn't make much sense for a SchemaTuple since the 
underlying structures are primitives, not objects.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 479
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line479>
bq.  >
bq.  >     DefaultTuple replaces null with ""

Good idea

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 484
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line484>
bq.  >
bq.  >     We could have a list implementation that just wraps a tuple.

Hmm, I think this is an intriguing idea, though it might get sills, because 
that means you'd have a list which wraps a tuple which wraps a list.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 495
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line495>
bq.  >
bq.  >     yep, sounds like it could be generated and fall back to this if the 
other is a regular tuple

My working copy almost has this finished. there are three comparison functions: 
compareTo(Object), compareTo(SchemaTuple), and compareToSpecific(SchemaTuple_). 
It will check (unless you tell it not to) if you were actually given a 
SchemaTuple_X (or just a schemaTuple) and do progressively faster comparisons. 
I also need to generate a RawComparator but I'd rather clean everything else up 
first (especially the layout of the API and where pieces live).

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 535
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line535>
bq.  >
bq.  >     you may want to log compilation warnings/errors

Is there a need to log it if the runtime exception is going to be thrown 
anyway? Because those errors get written to stderr. But I suppose I could pipe 
them through the Logger, if that's what you mean?

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 545
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line545>
bq.  >
bq.  >     Class.forName(className) uses the classLoader of the current class.

nice!

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 585
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line585>
bq.  >
bq.  >     if it does not need to be public then yes

it depends heavily on how we end up forcing people to make these things.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 609-635
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line609>
bq.  >
bq.  >     the parameter t is thrown away ?
bq.  >     
bq.  >     they don't need to be public.

This is more for conciseness in the code that generates the SchemaTuple_ code. 
It allows me to make a bunch of if then elses smaller. I suppose there are 
other ways to do it that don't involve a method call... I go back and forth on 
that. But the reason is so later on I can do:
add("        setPos_X(SchemaTuple.unbox(val, pos_X));");
which means that the right one will be based on the type of the field being 
filled (ie saving me a bunch of lines, but perhaps that should be trumped by 
cleanliness of the output code/API).

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 637-659
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line637>
bq.  >
bq.  >     depending on how those are used, java already does auto un-boxing. 
bq.  >     The following works so you may not need them:
bq.  >     long unbox(Long v) {
bq.  >        return v;
bq.  >     }
bq.  >     
bq.  >     they don't need to be public

Good point. I hate auto-boxing/un-boxing as a language feature, but if it can 
be utilized to clean things up it is worth it.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 667-685
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line667>
bq.  >
bq.  >     same comment, the following works:
bq.  >     Float box(float v) {
bq.  >       return v;
bq.  >     }
bq.  >     
bq.  >     they don't need to be public

Yeah same as above as well. I think I can actually also leverage auto 
boxing/unboxing deeper in the code to remove some of the need for these 
functions.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 698
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line698>
bq.  >
bq.  >     this file has 1500 lines :)
bq.  >     I think those classes have earned their own package.

Yeah, I mean, you're right... but the logic is EXTREMELY narrow, which is why I 
don't like the idea of splitting it out. Yeah, it makes the file larger, but it 
makes 0 sense outside of the context of what is going on here. Perhaps that 
should be rectified?

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 857-860
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line857>
bq.  >
bq.  >     this could be a method in the parent class

do you mean in SchemaTuple, or in TypeInFunctionStringOut?

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1138
bq.  > 
<https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1138>
bq.  >
bq.  >     CheckIfNullString.masks
bq.  >     maybe they should be in a Masks class then, along with the masking 
code generation (" & (byte)+...)

I don't know how worthwhile it is to split all of this stuff out into classes. 
Maybe it is. A lot of stuff is used once or twice in disparate places, and I 
fear that splitting it out more and more will just make people jump through 50 
classes to understand what a line of code gen is doing. On the other hand, it 
would make fixing bugs much cleaner. What probably needs to be done is some 
sort of cleaner code generation framework... would appreciate more thoughts 
here.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1283
bq.  > 
<https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1283>
bq.  >
bq.  >     This feels weird.
bq.  >     what about:
bq.  >     "SchemaTuple.unbox(("+getBoxedTypeName()+")appendGet(diff))"
bq.  >     
bq.  >     getBoxedTypeName() {
bq.  >      switch (type) {
bq.  >        case (DataType.INTEGER): return "Integer";
bq.  >     ...

I actually was able to replace it with:
add("    case ("+fieldNum+"): return "+fs.type+";");
sometimes I miss the forest for the trees.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1413
bq.  > 
<https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1413>
bq.  >
bq.  >     use 
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/StringBuilder.html for 
all those string concatenations.

yeahh this is a good idea. I doubt the code gen takes up much time at all but 
given the gen'd files are potentially large, it's probably good practice

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 1522-1524
bq.  > 
<https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1522>
bq.  >
bq.  >     what does isObjNotTup mean? What is this case?

it was an obscure case that was obsolete. I removed it

- Jonathan

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4651/#review6722
-----------------------------------------------------------

On 2012-04-05 01:31:00, Jonathan Coveney wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4651/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-05 01:31:00)
bq.  
bq.  
bq.  Review request for Julien Le Dem.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  This work builds on Dmitriy's PrimitiveTuple work. The idea is that, 
knowing the Schema on the frontend, we can code generate Tuples which can be 
used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x 
better, and it's ~15% smaller serialized (heavily heavily depends on the data, 
though). Need to do get/set tests, but assuming that it's on par (or even 
faster) than Tuple, the memory gain is huge.
bq.  
bq.  Need to clean up the code and add tests.
bq.  
bq.  Right now, it generates a SchemaTuple for every inputSchema and 
outputSchema given to UDF's. The next step is to make a SchemaBag, where I 
think the serialization savings will be really huge.
bq.  
bq.  Needs tests and comments, but I want the code to settle a bit.
bq.  
bq.  
bq.  This addresses bug PIG-2632.
bq.      https://issues.apache.org/jira/browse/PIG-2632
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628 
bq.    
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
 1309628 
bq.    trunk/src/org/apache/pig/data/BinInterSedes.java 1309628 
bq.    trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628 
bq.    trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION 
bq.    trunk/src/org/apache/pig/data/TupleFactory.java 1309628 
bq.    trunk/src/org/apache/pig/impl/PigContext.java 1309628 
bq.    
trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 
1309628 
bq.  
bq.  Diff: https://reviews.apache.org/r/4651/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jonathan
bq.  
bq.

> Create a SchemaTuple which generates efficient Tuples via code gen
> ------------------------------------------------------------------
>
>                 Key: PIG-2632
>                 URL: https://issues.apache.org/jira/browse/PIG-2632
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jonathan Coveney
>            Assignee: Jonathan Coveney
>             Fix For: 0.11
>
>         Attachments: PIG-2632-0.patch, PIG-2632-1.patch
>
>
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing 
> the Schema on the frontend, we can code generate Tuples which can be used for 
> fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, 
> and it's ~15% smaller serialized (heavily heavily depends on the data, 
> though). Need to do get/set tests, but assuming that it's on par (or even 
> faster) than Tuple, the memory gain is huge.
> Need to clean up the code and add tests.
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema 
> given to UDF's. The next step is to make a SchemaBag, where I think the 
> serialization savings will be really huge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2632) Create a SchemaTuple which generates efficient Tuples via code gen

Reply via email to