[jira] [Commented] (PIG-2632) Create a SchemaTuple which generates efficient Tuples via code gen

[email protected] (Commented) (JIRA) Fri, 06 Apr 2012 12:51:53 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248654#comment-13248654
 ]

[email protected] commented on PIG-2632:
----------------------------------------------------

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > 
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java,
 line 188
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100075#file100075line188>
bq.  >
bq.  >     I would make it:
bq.  >     TupleFactory.geInstanceForSchema().newTuple()
bq.  
bq.  Jonathan Coveney wrote:
bq.      I think it's worth thinking about how we want to do this. Dmitriy 
implemented newTupleForSchema for the PrimitiveTuple work, and he suggested I 
use that for this... however I agree, I think there should be a 
"SchemaTupleFactory" that generates SchemaTuples of a given SchemaTuple. At the 
same time, that might be a lot of scaffolding for not a lot of gain: once you 
generate the code for a given SchemaTuple, there's not a lot of work to be done 
(though this could encapsulate some of the static maps that come later). One 
other factor is that the TupleFactory interface doesn't really lend itself very 
nicely to the SchemaTuple, but I do think it could be extended and be made to 
work. Would appreciate more of your thoughts on this.
bq.  
bq.  Julien Le Dem wrote:
bq.      I meant:
bq.      TupleFactory.geInstanceForSchema(inputSchemaForGen).newTuple()
bq.      
bq.      The change is relatively small, it is just moving this from one class 
to the other and getting rid of the Maps. PrimitiveTuple should follow the same 
logic.
bq.      
bq.      Asking for the schema in the factory initialization will simplify the 
code. As you say, you won't need to constantly look up the schema. Also pig is 
about writing a lot of tuples of the same type. 
bq.      
bq.      The block of code (see 2 comments above) that triggers the code 
generation would just get the factories instead of actually generating Tuples. 
This is caused by the api asking for the schema in the wrong place.
bq.      
bq.      TupleFactory is an abstract class so it should be doable to add 
methods without breaking compatibility.
bq.      
bq.      If needed, backard compatibility can be maintained by having 
newTupleForSchema(inputSchemaForGen) call 
TupleFactory.geInstanceForSchema(inputSchemaForGen).newTuple()
bq.

I'd like your thoughts fleshing out the TupleFactory interface for a 
SchemaTupleFactory (which imho at this point makes the most sense -- 
TupleFactory.getInstanceForSchema(inputSchema); returns a SchemaTupleFactory 
which extends TupleFactory).

newTuple();
easy

newTuple(int size);
meaningless, throw an error?

newTuple(List c);
uses set(List object), throws error otherwise?

newTuple(Object datum);
meaningless?

newTupleNoCopy(List list);
same as newTuple(list);

If that sounds reasonable, I like this approach

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 70
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line70>
bq.  >
bq.  >     largestSetValue should already be correct. Why does largestSetValue 
stays at the size of the schema?
bq.  
bq.  Jonathan Coveney wrote:
bq.      I guess once it is at the value of the SchemaTuple (excluding 
appends), there's no need to update it.
bq.  
bq.  Julien Le Dem wrote:
bq.      Correct but know you have to check this every time you update. I was 
thinking it would simplify the code a little if it was just the actual size of 
the tuple (including past the original size).

Well, I guess that depends how we define the size of the tuple. To me, the size 
= number of fields (independently of whether they are set) + number of append 
fields. The reason we need this field, then, is to keep track of whether or not 
we've set one of the generated fields. We could redefine the size to be the 
largest non-null field, but that's going to make it slower for minimal gain. 
Thoughts?

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 130-142
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line130>
bq.  >
bq.  >     could be computed statically by the generated class if you make a 
static getSchema(schemaString).
bq.  
bq.  Jonathan Coveney wrote:
bq.      Well, my thought is that a static cache wouldn't take up much memory, 
and it would avoid having to recreate the Schema object every time getSchema() 
is called. I don't forsee it being called a ton, but possibly enough that 
recreating the object a bunch would be wasteful. Perhaps this is premature 
optimization?
bq.  
bq.  Julien Le Dem wrote:
bq.      caching is fine.
bq.      I meant that you could have the Schema object "cached" in a static 
field of the generated class
bq.      instead of a runtime generation + cache
bq.      
bq.      in the generated class:
bq.      
bq.      static schemaString = "....";
bq.      static Schema schema = Utils.getSchemaFromString(schemaString);
bq.      
bq.      public Schema getSchema() {
bq.        return schema;
bq.      }
bq.      
bq.

Ah, very good call

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 159
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line159>
bq.  >
bq.  >     do we need all of set(Tuple), set(List), setEqualTo(List), 
setEqualTo(Tuple) ?
bq.  
bq.  Jonathan Coveney wrote:
bq.      My working version adds even more. I think a lot of it should be made 
protected, or at least, I should be more thoughtful about what it should be. I 
think "set" is probably enough, but perhaps it should just be the void version? 
I guess this is where working with ruby where everything generally returns 
"this" by default, thus allowing for nice chaining of methods, is the norm.
bq.  
bq.  Julien Le Dem wrote:
bq.      I'm fine with the chaining mechanism.
bq.      We should try to avoid having both as it makes the code harder to 
read. You cant set and ignore the return value;

good call

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 310-321
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line310>
bq.  >
bq.  >     refactor DataByteArray to provide this?
bq.  
bq.  Jonathan Coveney wrote:
bq.      So here is the thing about this. Currently, a ton of the useful 
methods in BinInterSedes (where this logic was taken from) is private. Also, 
they aren't static (though there's no real reason why they shouldn't be?) This 
snippet (and other logic like it) should probably be made into a protected 
static method of BinInterSedes. Thoughts? It could also make sense for classes 
to have a static method to serialize themselves, but I'd argue that the 
BinInterSedes approach is probably more consistent with how pig is laid out 
(though I have no idea why most of the methods there are private instead of 
protected static).
bq.  
bq.  Julien Le Dem wrote:
bq.      private stuff can be safely moved around and refactored :) (from a 
compatibility point of view)
bq.      Let's think of a way to have the same code used in all cases. There 
used to be one type of tuple so what made sense before may need to changee.
bq.      
bq.      - parent class for both ?
bq.      - TupleSerializer class ?
bq.      - static helpers?

per your suggestion and dmitriy's go ahead on the list serv, I'm moving a bit 
of the shared logic to a SedesHelper class which both BinInterSedes and 
SchemaTuple will use

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, lines 325-333
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line325>
bq.  >
bq.  >     Same comment, the String serialization should be shared with the 
regular Tuple.
bq.  >     
bq.  >     Also we could keep the a UTF8 in the SchemaTuple to reduce memory 
usage and convert lazilyto String (possibly caching the result)
bq.  
bq.  Jonathan Coveney wrote:
bq.      This logic is stolen directly from BinInterSedes (see comment above). 
Can you flesh out what you mean about UTF8? If we change it here we should 
probably change it there as well (and make that the canonical protected static 
source of all your string serialization needs).
bq.  
bq.  Julien Le Dem wrote:
bq.      Following your goal of reducing memory utilization and as the 
serialized format is UTF8, we could keep UTF8 as the memory representation 
instead of String.
bq.      Java Strings are UTF16 which is minimum 2 bytes per characters. If the 
data is mostly ascii, this is wasteful. 
bq.      To maintain backward compatibility the get(int) would lazily convert 
to String on demand, possibly caching the result in a second field to save 
processing, but this may be overkill.

What is the latency on utf8 -> utf16 conversion? That's all I'd be worried 
about but your suggestion is a good one.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 452
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line452>
bq.  >
bq.  >     is (null, null, null) the same as null ?
bq.  
bq.  Jonathan Coveney wrote:
bq.      In DefaultTuple, this is deprecated and not really used. Given that, I 
use this to basically mean "has any information been written to this." So in 
your case, both would be null. The more important question is to make sure I 
use it consistently throughout. Honestly, the whole null thing is a pain, and I 
need to really comb over things to make sure I incorporated it properly.
bq.  
bq.  Julien Le Dem wrote:
bq.      agreed, we should just make sure this is the same behavior as the 
DefaultTuple

DefaultTuple doesn't implement isNull(). It's just a dummy null implementation. 
But the difference is that I have fields that can be "null" but aren't actually 
null (ie primitives), so it's a bit more useful.

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 474
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line474>
bq.  >
bq.  >     this.set(t) ?
bq.  
bq.  Jonathan Coveney wrote:
bq.      reference is this weird Franken-method that isn't used anywhere in the 
codebase. I don't know that I want to implement it when the semantics of what 
it should do don't seem clear at all. Open to thoughts on this though. The 
original intent of it doesn't make much sense for a SchemaTuple since the 
underlying structures are primitives, not objects.
bq.  
bq.  Julien Le Dem wrote:
bq.      "the underlying structures are primitives, not objects"
bq.      for now, as Bags and Maps will come.
bq.      For backward compatibility, I think we should implement this as a 
set(). It seems this is intended as a cheap set, not sure if "modifying one 
modifies the others" behavior is expected. some UDFs could depend on this.
bq.      
bq.      Also let's mark reference() as deprecated right now so that we can 
remove it later.

 I have no problem making this a set. The nice part about doing that is it 
means you can essentially get the functionality of a set without having to cast 
it. The downside being the weird semantics around it. Probably a net win. Agree 
on deprecating reference

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 698
bq.  > <https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line698>
bq.  >
bq.  >     this file has 1500 lines :)
bq.  >     I think those classes have earned their own package.
bq.  
bq.  Jonathan Coveney wrote:
bq.      Yeah, I mean, you're right... but the logic is EXTREMELY narrow, which 
is why I don't like the idea of splitting it out. Yeah, it makes the file 
larger, but it makes 0 sense outside of the context of what is going on here. 
Perhaps that should be rectified?
bq.  
bq.  Julien Le Dem wrote:
bq.      I don't feel strongly about it.
bq.      Maybe you want to separate the base class SchemaTuple (runtime) from 
the Generator ?

Agreed, very good suggestion

bq.  On 2012-04-06 03:30:08, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 1138
bq.  > 
<https://reviews.apache.org/r/4651/diff/1/?file=100078#file100078line1138>
bq.  >
bq.  >     CheckIfNullString.masks
bq.  >     maybe they should be in a Masks class then, along with the masking 
code generation (" & (byte)+...)
bq.  
bq.  Jonathan Coveney wrote:
bq.      I don't know how worthwhile it is to split all of this stuff out into 
classes. Maybe it is. A lot of stuff is used once or twice in disparate places, 
and I fear that splitting it out more and more will just make people jump 
through 50 classes to understand what a line of code gen is doing. On the other 
hand, it would make fixing bugs much cleaner. What probably needs to be done is 
some sort of cleaner code generation framework... would appreciate more 
thoughts here.
bq.  
bq.  Julien Le Dem wrote:
bq.      My goal is to have the bit management thing implemented once so that 
it is easier to change/improve/bugfix.
bq.      agreed that we should keep things simple.
bq.      If we have a SchemaTupleClassGenerator  that contains bit management 
helpers that could simplify things.
bq.      Separating the runtime from the generation would reduce the amount of 
things you have to look at at the same time.

+1

- Jonathan

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4651/#review6722
-----------------------------------------------------------

On 2012-04-06 17:20:20, Jonathan Coveney wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4651/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-06 17:20:20)
bq.  
bq.  
bq.  Review request for pig and Julien Le Dem.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  This work builds on Dmitriy's PrimitiveTuple work. The idea is that, 
knowing the Schema on the frontend, we can code generate Tuples which can be 
used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x 
better, and it's ~15% smaller serialized (heavily heavily depends on the data, 
though). Need to do get/set tests, but assuming that it's on par (or even 
faster) than Tuple, the memory gain is huge.
bq.  
bq.  Need to clean up the code and add tests.
bq.  
bq.  Right now, it generates a SchemaTuple for every inputSchema and 
outputSchema given to UDF's. The next step is to make a SchemaBag, where I 
think the serialization savings will be really huge.
bq.  
bq.  Needs tests and comments, but I want the code to settle a bit.
bq.  
bq.  
bq.  This addresses bug PIG-2632.
bq.      https://issues.apache.org/jira/browse/PIG-2632
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628 
bq.    
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
 1309628 
bq.    trunk/src/org/apache/pig/data/BinInterSedes.java 1309628 
bq.    trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628 
bq.    trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION 
bq.    trunk/src/org/apache/pig/data/TupleFactory.java 1309628 
bq.    trunk/src/org/apache/pig/impl/PigContext.java 1309628 
bq.    
trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 
1309628 
bq.  
bq.  Diff: https://reviews.apache.org/r/4651/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jonathan
bq.  
bq.

> Create a SchemaTuple which generates efficient Tuples via code gen
> ------------------------------------------------------------------
>
>                 Key: PIG-2632
>                 URL: https://issues.apache.org/jira/browse/PIG-2632
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jonathan Coveney
>            Assignee: Jonathan Coveney
>             Fix For: 0.11
>
>         Attachments: PIG-2632-0.patch, PIG-2632-1.patch
>
>
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing 
> the Schema on the frontend, we can code generate Tuples which can be used for 
> fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, 
> and it's ~15% smaller serialized (heavily heavily depends on the data, 
> though). Need to do get/set tests, but assuming that it's on par (or even 
> faster) than Tuple, the memory gain is huge.
> Need to clean up the code and add tests.
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema 
> given to UDF's. The next step is to make a SchemaBag, where I think the 
> serialization savings will be really huge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2632) Create a SchemaTuple which generates efficient Tuples via code gen

Reply via email to