[jira] [Commented] (PIG-2632) Create a SchemaTuple which generates efficient Tuples via code gen

jirapos...@reviews.apache.org (Commented) (JIRA) Thu, 05 Apr 2012 20:31:48 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247999#comment-13247999
 ]


jirapos...@reviews.apache.org commented on PIG-2632:
----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4651/#review6722
-----------------------------------------------------------


This is an incredible piece of work! 
See my comments bellow


trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
<https://reviews.apache.org/r/4651/#comment14556>

    it would be nice if Pig called outputSchema() once during the init and that 
was saved somewhere.



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
<https://reviews.apache.org/r/4651/#comment14557>

    So the side effect is that the Tuple class get generated here ? This should 
be encapsulated in a more obvious method here.
    like: generateTupleForSchema(tmpS) and not necessarily create an instance 
of a tuple.
    also that would remove duplication here.



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
<https://reviews.apache.org/r/4651/#comment14558>

    those are the inputSchema and the outputSchema. The fields should probably 
called just that and have boolean flag to know if we use generated Tuples?



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
<https://reviews.apache.org/r/4651/#comment14559>

    I would make it:
    TupleFactory.geInstanceForSchema().newTuple()



trunk/src/org/apache/pig/data/BinInterSedes.java
<https://reviews.apache.org/r/4651/#comment14560>

    should SchemaTuple extend TypeAwareTuple ?



trunk/src/org/apache/pig/data/PrimitiveTuple.java
<https://reviews.apache.org/r/4651/#comment14564>

    BytesKey ?



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14573>

    statics on top



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14574>

    largestSetValue should already be correct. Why does largestSetValue stays 
at the size of the schema?



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14575>

    isAppendNull()
    Should the append...() methods be public?



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14576>

    the result would be the same, right?



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14580>

    setAppend()



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14578>

    http://www.ibm.com/developerworks/java/library/j-codetoheap/index.html



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14631>

    if it is always overridden, it should be abstract



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14579>

    getAppendType()



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14584>

    set? stream?



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14583>

    getSizeNoAppend() ?



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14585>

    redundant with the 2 others?
    idClassMap.get(schemaIdMap.get(schema))



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14586>

    could be computed statically by the generated class if you make a static 
getSchema(schemaString).



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14587>

    setAll() ?



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14588>

    setAll() ?



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14589>

    do we need all of set(Tuple), set(List), setEqualTo(List), 
setEqualTo(Tuple) ?



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14591>

    you could add a properties file to the jar with this info. Or a generated 
factory that knows all the classes generated. You mainly need the mapping 
schema -> id
    
    you could also use true. Although that would not be THE answer to THE 
question.



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14592>

    As we have schema aware Tuples, in the future we could add get(String 
fieldName). We may not want to strip the aliases to reuse the class.



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14593>

    looking forward to support for those :)



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14596>

    there's only one code path where updateMaps is true. It could be simplified 
a bit.



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14595>

    Logger



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14597>

    they should not be public



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14598>

    refactor DataByteArray to provide this?



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14599>

    Same comment, the String serialization should be shared with the regular 
Tuple.
    
    Also we could keep the a UTF8 in the SchemaTuple to reduce memory usage and 
convert lazilyto String (possibly caching the result)



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14600>

    same comment as before.
    The same logic should be use with regular tuples.
    The same logic should be shared across types.
    
    It could also use a variable size int format.
    https://developers.google.com/protocol-buffers/docs/encoding#types



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14601>

    is (null, null, null) the same as null ?



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14609>

    this.set(t) ?



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14610>

    DefaultTuple replaces null with ""



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14611>

    We could have a list implementation that just wraps a tuple.



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14612>

    yep, sounds like it could be generated and fall back to this if the other 
is a regular tuple



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14614>

    you may want to log compilation warnings/errors



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14613>

    Class.forName(className) uses the classLoader of the current class.



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14615>

    if it does not need to be public then yes



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14616>

    same comment as before about stripped aliases



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14617>

    the parameter t is thrown away ?
    
    they don't need to be public.



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14618>

    depending on how those are used, java already does auto un-boxing. 
    The following works so you may not need them:
    long unbox(Long v) {
       return v;
    }
    
    they don't need to be public



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14620>

    same comment, the following works:
    Float box(float v) {
      return v;
    }
    
    they don't need to be public



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14622>

    this file has 1500 lines :)
    I think those classes have earned their own package.



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14625>

    this could be a method in the parent class



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14626>

    this could be a method in the parent class



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14635>

    yes, an array is an object, so reference + sizeof(array)



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14638>

    CheckIfNullString.masks
    maybe they should be in a Masks class then, along with the masking code 
generation (" & (byte)+...) 



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14643>

    This feels weird.
    what about:
    "SchemaTuple.unbox(("+getBoxedTypeName()+")appendGet(diff))"
    
    getBoxedTypeName() {
     switch (type) {
       case (DataType.INTEGER): return "Integer";
    ...



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14621>

    use 
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/StringBuilder.html for 
all those string concatenations.



trunk/src/org/apache/pig/data/SchemaTuple.java
<https://reviews.apache.org/r/4651/#comment14619>

    what does isObjNotTup mean? What is this case?


- Julien


On 2012-04-05 01:31:00, Jonathan Coveney wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4651/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-05 01:31:00)
bq.  
bq.  
bq.  Review request for Julien Le Dem.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  This work builds on Dmitriy's PrimitiveTuple work. The idea is that, 
knowing the Schema on the frontend, we can code generate Tuples which can be 
used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x 
better, and it's ~15% smaller serialized (heavily heavily depends on the data, 
though). Need to do get/set tests, but assuming that it's on par (or even 
faster) than Tuple, the memory gain is huge.
bq.  
bq.  Need to clean up the code and add tests.
bq.  
bq.  Right now, it generates a SchemaTuple for every inputSchema and 
outputSchema given to UDF's. The next step is to make a SchemaBag, where I 
think the serialization savings will be really huge.
bq.  
bq.  Needs tests and comments, but I want the code to settle a bit.
bq.  
bq.  
bq.  This addresses bug PIG-2632.
bq.      https://issues.apache.org/jira/browse/PIG-2632
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/src/org/apache/pig/data/TypeAwareTuple.java 1309628 
bq.    
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
 1309628 
bq.    trunk/src/org/apache/pig/data/BinInterSedes.java 1309628 
bq.    trunk/src/org/apache/pig/data/PrimitiveTuple.java 1309628 
bq.    trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION 
bq.    trunk/src/org/apache/pig/data/TupleFactory.java 1309628 
bq.    trunk/src/org/apache/pig/impl/PigContext.java 1309628 
bq.    
trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 
1309628 
bq.  
bq.  Diff: https://reviews.apache.org/r/4651/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jonathan
bq.  
bq.


                
> Create a SchemaTuple which generates efficient Tuples via code gen
> ------------------------------------------------------------------
>
>                 Key: PIG-2632
>                 URL: https://issues.apache.org/jira/browse/PIG-2632
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jonathan Coveney
>            Assignee: Jonathan Coveney
>             Fix For: 0.11
>
>         Attachments: PIG-2632-0.patch, PIG-2632-1.patch
>
>
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing 
> the Schema on the frontend, we can code generate Tuples which can be used for 
> fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, 
> and it's ~15% smaller serialized (heavily heavily depends on the data, 
> though). Need to do get/set tests, but assuming that it's on par (or even 
> faster) than Tuple, the memory gain is huge.
> Need to clean up the code and add tests.
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema 
> given to UDF's. The next step is to make a SchemaBag, where I think the 
> serialization savings will be really huge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2632) Create a SchemaTuple which generates efficient Tuples via code gen

Reply via email to