[jira] [Commented] (PIG-2632) Create a SchemaTuple which generates efficient Tuples via code gen

jirapos...@reviews.apache.org (Commented) (JIRA) Fri, 06 Apr 2012 10:19:49 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248509#comment-13248509
 ]

jirapos...@reviews.apache.org commented on PIG-2632:
----------------------------------------------------

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > 
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java,
 line 133
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line133>
bq.  >
bq.  >     it would be nice if Pig called outputSchema() once during the init 
and that was saved somewhere.
bq.  
bq.  Jonathan Coveney wrote:
bq.      It calls it many times, alas. I feel like changing that is a job for 
another patch, but I could tackle that here. But given there is already the 
expectation that it will be called an arbitrary number of times, I don't think 
that calling it once more on the front end will be an issue?

I agree,let's not add to this already big patch. This should be dealt with in a 
separate patch.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > 
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java,
 lines 151-152
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line151>
bq.  >
bq.  >     those are the inputSchema and the outputSchema. The fields should 
probably called just that and have boolean flag to know if we use generated 
Tuples?
bq.  
bq.  Jonathan Coveney wrote:
bq.      Well, we need to know what the tuple to generate is. I will change it 
save the ID that serves as the base, but as is the Schema info isn't saved and 
it isn't available.

I was more commenting on the naming. Those are the input and output schema 
independently of code generation. 

Knowing if you can generate or not could be saved with a boolean. 

Possibly you don't even need to save the information than can generate the 
tuples or not. It could fall back to the DefaultTuple.

(Anyway, this one is not very important)

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > 
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java,
 line 188
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line188>
bq.  >
bq.  >     I would make it:
bq.  >     TupleFactory.geInstanceForSchema().newTuple()
bq.  
bq.  Jonathan Coveney wrote:
bq.      I think it's worth thinking about how we want to do this. Dmitriy 
implemented newTupleForSchema for the PrimitiveTuple work, and he suggested I 
use that for this... however I agree, I think there should be a 
"SchemaTupleFactory" that generates SchemaTuples of a given SchemaTuple. At the 
same time, that might be a lot of scaffolding for not a lot of gain: once you 
generate the code for a given SchemaTuple, there's not a lot of work to be done 
(though this could encapsulate some of the static maps that come later). One 
other factor is that the TupleFactory interface doesn't really lend itself very 
nicely to the SchemaTuple, but I do think it could be extended and be made to 
work. Would appreciate more of your thoughts on this.

I meant:
TupleFactory.geInstanceForSchema(inputSchemaForGen).newTuple()

The change is relatively small, it is just moving this from one class to the 
other and getting rid of the Maps. PrimitiveTuple should follow the same logic.

Asking for the schema in the factory initialization will simplify the code. As 
you say, you won't need to constantly look up the schema. Also pig is about 
writing a lot of tuples of the same type. 

The block of code (see 2 comments above) that triggers the code generation 
would just get the factories instead of actually generating Tuples. This is 
caused by the api asking for the schema in the wrong place.

TupleFactory is an abstract class so it should be doable to add methods without 
breaking compatibility.

If needed, backard compatibility can be maintained by having 
newTupleForSchema(inputSchemaForGen) call 
TupleFactory.geInstanceForSchema(inputSchemaForGen).newTuple()

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 70
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line70>
bq.  >
bq.  >     largestSetValue should already be correct. Why does largestSetValue 
stays at the size of the schema?
bq.  
bq.  Jonathan Coveney wrote:
bq.      I guess once it is at the value of the SchemaTuple (excluding 
appends), there's no need to update it.

Correct but know you have to check this every time you update. I was thinking 
it would simplify the code a little if it was just the actual size of the tuple 
(including past the original size).

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 105
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line105>
bq.  >
bq.  >     if it is always overridden, it should be abstract
bq.  
bq.  Jonathan Coveney wrote:
bq.      It's overriden, but the overriden version calls super. Right now it 
returns 0 because I don't want to take the time to calculate it out while the 
class is still evolving.

Ok, let's make it abstract once we get to this.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 125
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line125>
bq.  >
bq.  >     redundant with the 2 others?
bq.  >     idClassMap.get(schemaIdMap.get(schema))
bq.  
bq.  Jonathan Coveney wrote:
bq.      Perhaps this is what a SchemaTupleFactory would make cleaner, but my 
desire is for Tuple creation to be as fast as possible. This is why there are 5 
maps... my thought is that they are all static, so the extra cost of caching 
the extra info is worth avoiding two gets in a row to link the info together 
when we need to make Tuples. Am open to thoughts.

I like the SchemaTupleFactory approach. It would alleviate the need for this.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 130-142
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line130>
bq.  >
bq.  >     could be computed statically by the generated class if you make a 
static getSchema(schemaString).
bq.  
bq.  Jonathan Coveney wrote:
bq.      Well, my thought is that a static cache wouldn't take up much memory, 
and it would avoid having to recreate the Schema object every time getSchema() 
is called. I don't forsee it being called a ton, but possibly enough that 
recreating the object a bunch would be wasteful. Perhaps this is premature 
optimization?

caching is fine.
I meant that you could have the Schema object "cached" in a static field of the 
generated class
instead of a runtime generation + cache

in the generated class:

static schemaString = "....";
static Schema schema = Utils.getSchemaFromString(schemaString);

public Schema getSchema() {
  return schema;
}

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 144
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line144>
bq.  >
bq.  >     setAll() ?
bq.  
bq.  Jonathan Coveney wrote:
bq.      It's not really a setAll imho. It's setting it equal to. Maybe that's 
what setAll means, but it seems to convey something different.

It sounds like setEqualTo() is what set() generally means. when you set 
something to a value then you expect it to be equal to the value. What do you 
think?

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 159
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line159>
bq.  >
bq.  >     do we need all of set(Tuple), set(List), setEqualTo(List), 
setEqualTo(Tuple) ?
bq.  
bq.  Jonathan Coveney wrote:
bq.      My working version adds even more. I think a lot of it should be made 
protected, or at least, I should be more thoughtful about what it should be. I 
think "set" is probably enough, but perhaps it should just be the void version? 
I guess this is where working with ruby where everything generally returns 
"this" by default, thus allowing for nice chaining of methods, is the norm.

I'm fine with the chaining mechanism.
We should try to avoid having both as it makes the code harder to read. You 
cant set and ignore the return value;

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 191
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line191>
bq.  >
bq.  >     you could add a properties file to the jar with this info. Or a 
generated factory that knows all the classes generated. You mainly need the 
mapping schema -> id
bq.  >     
bq.  >     you could also use true. Although that would not be THE answer to 
THE question.
bq.  
bq.  Jonathan Coveney wrote:
bq.      I think the jar properties is the way to go, at least as the ultimate 
source of the classes that were generated. Alternately, it'd be nice if there 
was an easy way to browse a jar for certain class files, but I couldn't find a 
clean one? If there was, we could just search for SchemaTuple_*.class and keep 
track of which files went across. Do you know if there is an easier way to 
search jars like that? I can dig around as well.

tl;dr: As you know what classes you generate I would go for a mechanism where 
you provide the list of classes to load like a properties file in the jar.

I don't think there is a clean way of scanning a package out of the box. 
Usually dependency injection containers (like Spring) have a mechanism for 
scanning the classpath for classes with a given annotation.
example:
http://static.springsource.org/spring/docs/3.0.0.M3/spring-framework-reference/html/ch04s12.html
(I'm not saying we should add dependencies to Pig! Just FYI)

You could have a simple scanning mechanism using getResource():
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/ClassLoader.html#getResource%28java.lang.String%29
this.getClass().getClassLoader().getResource("my/package/path")
will return a URL and you can scan yourself the location of the URL.
If the package is not in a jar and on the local file system, you will get a 
file:/... url that you can easily make a File of.
If the package is in a jar, then you will need to use a JarFile to scan the 
content of the package in the jar.
The issue here is that the classloading mechanism is extensible: The URL could 
be a HTTP url or some other that do not provide a listing option. There is no 
way to have a fully generic scanning mechanism if you don't know the name of 
the classes you want.

A search for "enumerate classes in a package" returns examples of people doing 
this.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 209
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line209>
bq.  >
bq.  >     As we have schema aware Tuples, in the future we could add 
get(String fieldName). We may not want to strip the aliases to reuse the class.
bq.  
bq.  Jonathan Coveney wrote:
bq.      Hmm, it seems pretty silly that two Schemas that are essentially 
"int,int,long" except with different names would necessitate a different class. 
More importantly, however, I ran into cases where schema names were different 
due to the way that names were generated by different pieces of code, and so 
the lookup was thwarted if you were strict about names. Hmm hmm hmm. Do you 
think the win from get(String) is worth the potential complication? It could be 
convenient...

I would need to look more into it. I would be interested in seeing the problems 
you see when not stripping the names.

This seems changeable later if needed, so let's keep it that way for now.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 270-273
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line270>
bq.  >
bq.  >     Logger
bq.  
bq.  Jonathan Coveney wrote:
bq.      woops, I usually mark these //remove and whack them before the patch 
goes out...

no worries. Usually it can be useful to keep debug log statements for future 
modification.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 310-321
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line310>
bq.  >
bq.  >     refactor DataByteArray to provide this?
bq.  
bq.  Jonathan Coveney wrote:
bq.      So here is the thing about this. Currently, a ton of the useful 
methods in BinInterSedes (where this logic was taken from) is private. Also, 
they aren't static (though there's no real reason why they shouldn't be?) This 
snippet (and other logic like it) should probably be made into a protected 
static method of BinInterSedes. Thoughts? It could also make sense for classes 
to have a static method to serialize themselves, but I'd argue that the 
BinInterSedes approach is probably more consistent with how pig is laid out 
(though I have no idea why most of the methods there are private instead of 
protected static).

private stuff can be safely moved around and refactored :) (from a 
compatibility point of view)
Let's think of a way to have the same code used in all cases. There used to be 
one type of tuple so what made sense before may need to changee.

- parent class for both ?
- TupleSerializer class ?
- static helpers?

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 325-333
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line325>
bq.  >
bq.  >     Same comment, the String serialization should be shared with the 
regular Tuple.
bq.  >     
bq.  >     Also we could keep the a UTF8 in the SchemaTuple to reduce memory 
usage and convert lazilyto String (possibly caching the result)
bq.  
bq.  Jonathan Coveney wrote:
bq.      This logic is stolen directly from BinInterSedes (see comment above). 
Can you flesh out what you mean about UTF8? If we change it here we should 
probably change it there as well (and make that the canonical protected static 
source of all your string serialization needs).

Following your goal of reducing memory utilization and as the serialized format 
is UTF8, we could keep UTF8 as the memory representation instead of String.
Java Strings are UTF16 which is minimum 2 bytes per characters. If the data is 
mostly ascii, this is wasteful. 
To maintain backward compatibility the get(int) would lazily convert to String 
on demand, possibly caching the result in a second field to save processing, 
but this may be overkill.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 452
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line452>
bq.  >
bq.  >     is (null, null, null) the same as null ?
bq.  
bq.  Jonathan Coveney wrote:
bq.      In DefaultTuple, this is deprecated and not really used. Given that, I 
use this to basically mean "has any information been written to this." So in 
your case, both would be null. The more important question is to make sure I 
use it consistently throughout. Honestly, the whole null thing is a pain, and I 
need to really comb over things to make sure I incorporated it properly.

agreed, we should just make sure this is the same behavior as the DefaultTuple

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 474
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line474>
bq.  >
bq.  >     this.set(t) ?
bq.  
bq.  Jonathan Coveney wrote:
bq.      reference is this weird Franken-method that isn't used anywhere in the 
codebase. I don't know that I want to implement it when the semantics of what 
it should do don't seem clear at all. Open to thoughts on this though. The 
original intent of it doesn't make much sense for a SchemaTuple since the 
underlying structures are primitives, not objects.

"the underlying structures are primitives, not objects"
for now, as Bags and Maps will come.
For backward compatibility, I think we should implement this as a set(). It 
seems this is intended as a cheap set, not sure if "modifying one modifies the 
others" behavior is expected. some UDFs could depend on this.

Also let's mark reference() as deprecated right now so that we can remove it 
later.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 484
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line484>
bq.  >
bq.  >     We could have a list implementation that just wraps a tuple.
bq.  
bq.  Jonathan Coveney wrote:
bq.      Hmm, I think this is an intriguing idea, though it might get sills, 
because that means you'd have a list which wraps a tuple which wraps a list.

this is just for saving a list allocation. A wrapper list would use less memory.

We can punt this for now.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 495
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line495>
bq.  >
bq.  >     yep, sounds like it could be generated and fall back to this if the 
other is a regular tuple
bq.  
bq.  Jonathan Coveney wrote:
bq.      My working copy almost has this finished. there are three comparison 
functions: compareTo(Object), compareTo(SchemaTuple), and 
compareToSpecific(SchemaTuple_). It will check (unless you tell it not to) if 
you were actually given a SchemaTuple_X (or just a schemaTuple) and do 
progressively faster comparisons. I also need to generate a RawComparator but 
I'd rather clean everything else up first (especially the layout of the API and 
where pieces live).

cool

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 535
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line535>
bq.  >
bq.  >     you may want to log compilation warnings/errors
bq.  
bq.  Jonathan Coveney wrote:
bq.      Is there a need to log it if the runtime exception is going to be 
thrown anyway? Because those errors get written to stderr. But I suppose I 
could pipe them through the Logger, if that's what you mean?

Ok.
You can also get them programatically, but stderr is good too. Let's make sure 
we don't swamp the output of Pig with too many messages.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 609-635
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line609>
bq.  >
bq.  >     the parameter t is thrown away ?
bq.  >     
bq.  >     they don't need to be public.
bq.  
bq.  Jonathan Coveney wrote:
bq.      This is more for conciseness in the code that generates the 
SchemaTuple_ code. It allows me to make a bunch of if then elses smaller. I 
suppose there are other ways to do it that don't involve a method call... I go 
back and forth on that. But the reason is so later on I can do:
bq.      add("        setPos_X(SchemaTuple.unbox(val, pos_X));");
bq.      which means that the right one will be based on the type of the field 
being filled (ie saving me a bunch of lines, but perhaps that should be trumped 
by cleanliness of the output code/API).

Ah yeah, this comment is obsoleted by my other comment bellow. I should have 
removed it.
I saw how it simplifies the generation later.
See my proposal for this bellow.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 698
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line698>
bq.  >
bq.  >     this file has 1500 lines :)
bq.  >     I think those classes have earned their own package.
bq.  
bq.  Jonathan Coveney wrote:
bq.      Yeah, I mean, you're right... but the logic is EXTREMELY narrow, which 
is why I don't like the idea of splitting it out. Yeah, it makes the file 
larger, but it makes 0 sense outside of the context of what is going on here. 
Perhaps that should be rectified?

I don't feel strongly about it.
Maybe you want to separate the base class SchemaTuple (runtime) from the 
Generator ?

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 857-860
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line857>
bq.  >
bq.  >     this could be a method in the parent class
bq.  
bq.  Jonathan Coveney wrote:
bq.      do you mean in SchemaTuple, or in TypeInFunctionStringOut?

Where it makes sense to you.
based on the way you removed everything that does not need to be generated, you 
could factor out the error handling and just generate "default: 
handleDefault(fieldNum)"

in SchemaTuple?

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1138
bq.  > 
<https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1138>
bq.  >
bq.  >     CheckIfNullString.masks
bq.  >     maybe they should be in a Masks class then, along with the masking 
code generation (" & (byte)+...)
bq.  
bq.  Jonathan Coveney wrote:
bq.      I don't know how worthwhile it is to split all of this stuff out into 
classes. Maybe it is. A lot of stuff is used once or twice in disparate places, 
and I fear that splitting it out more and more will just make people jump 
through 50 classes to understand what a line of code gen is doing. On the other 
hand, it would make fixing bugs much cleaner. What probably needs to be done is 
some sort of cleaner code generation framework... would appreciate more 
thoughts here.

My goal is to have the bit management thing implemented once so that it is 
easier to change/improve/bugfix.
agreed that we should keep things simple.
If we have a SchemaTupleClassGenerator  that contains bit management helpers 
that could simplify things.
Separating the runtime from the generation would reduce the amount of things 
you have to look at at the same time.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1283
bq.  > 
<https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1283>
bq.  >
bq.  >     This feels weird.
bq.  >     what about:
bq.  >     "SchemaTuple.unbox(("+getBoxedTypeName()+")appendGet(diff))"
bq.  >     
bq.  >     getBoxedTypeName() {
bq.  >      switch (type) {
bq.  >        case (DataType.INTEGER): return "Integer";
bq.  >     ...
bq.  
bq.  Jonathan Coveney wrote:
bq.      I actually was able to replace it with:
bq.      add("    case ("+fieldNum+"): return "+fs.type+";");
bq.      sometimes I miss the forest for the trees.

cool

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1413
bq.  > 
<https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1413>
bq.  >
bq.  >     use 
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/StringBuilder.html for 
all those string concatenations.
bq.  
bq.  Jonathan Coveney wrote:
bq.      yeahh this is a good idea. I doubt the code gen takes up much time at 
all but given the gen'd files are potentially large, it's probably good practice

a concatenation copies both strings, so concatenating large strings over and 
over can become slow.

Note that "a"+"B"+foo+bar+"c" gets replaced by the compiler by new 
StringBuilder().append("a").append("B").append(foo).append(bar).append("c").toString()
But you still concatenate those to other strings so will end up duplicating 
large strings.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 1522-1524
bq.  > 
<https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1522>
bq.  >
bq.  >     what does isObjNotTup mean? What is this case?
bq.  
bq.  Jonathan Coveney wrote:
bq.      it was an obscure case that was obsolete. I removed it

ok

- Julien

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4651/#review6722
-----------------------------------------------------------

On 2012-04-05 01:31:00, Jonathan Coveney wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4651/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-05 01:31:00)
bq.  
bq.  
bq.  Review request for Julien Le Dem.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  This work builds on Dmitriy's PrimitiveTuple work. The idea is that, 
knowing the Schema on the frontend, we can code generate Tuples which can be 
used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x 
better, and it's ~15% smaller serialized (heavily heavily depends on the data, 
though). Need to do get/set tests, but assuming that it's on par (or even 
faster) than Tuple, the memory gain is huge.
bq.  
bq.  Need to clean up the code and add tests.
bq.  
bq.  Right now, it generates a SchemaTuple for every inputSchema and 
outputSchema given to UDF's. The next step is to make a SchemaBag, where I 
think the serialization savings will be really huge.
bq.  
bq.  Needs tests and comments, but I want the code to settle a bit.
bq.  
bq.  
bq.  This addresses bug PIG-2632.
bq.      https://issues.apache.org/jira/browse/PIG-2632
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628 
bq.    
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
 1309628 
bq.    trunk/src/org/apache/pig/data/BinInterSedes.java 1309628 
bq.    trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628 
bq.    trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION 
bq.    trunk/src/org/apache/pig/data/TupleFactory.java 1309628 
bq.    trunk/src/org/apache/pig/impl/PigContext.java 1309628 
bq.    
trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 
1309628 
bq.  
bq.  Diff: https://reviews.apache.org/r/4651/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jonathan
bq.  
bq.

> Create a SchemaTuple which generates efficient Tuples via code gen
> ------------------------------------------------------------------
>
>                 Key: PIG-2632
>                 URL: https://issues.apache.org/jira/browse/PIG-2632
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jonathan Coveney
>            Assignee: Jonathan Coveney
>             Fix For: 0.11
>
>         Attachments: PIG-2632-0.patch, PIG-2632-1.patch
>
>
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing 
> the Schema on the frontend, we can code generate Tuples which can be used for 
> fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, 
> and it's ~15% smaller serialized (heavily heavily depends on the data, 
> though). Need to do get/set tests, but assuming that it's on par (or even 
> faster) than Tuple, the memory gain is huge.
> Need to clean up the code and add tests.
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema 
> given to UDF's. The next step is to make a SchemaBag, where I think the 
> serialization savings will be really huge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2632) Create a SchemaTuple which generates efficient Tuples via code gen

Reply via email to