[ 
https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249490#comment-13249490
 ] 

[email protected] commented on PIG-2632:
----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4651/
-----------------------------------------------------------

(Updated 2012-04-08 07:19:58.096859)


Review request for pig and Julien Le Dem.


Changes
-------

Julien, this incorporates many of your comments, but not all. Mainly, it has 
the refactoring of the code. A couple existant issues:
- The classloading is still janky. I'm not quite sure what the best approach is
- I need to figure out how to register the classes I generate in the jar 
manifest
- because of the way the code is generated, protected fields don't quite work. 
The code doesn't have a package, so only public methods are available. I marked 
the classes it dependended on as private, but I don't know if that is enough. 
If it's a big issue, I guess the next thing to do is to figure out how to 
generate code in a specific package of my choice, and ideally, how to generate 
the class in memory and add to the jar.
- And of course some finer points: I need to implement a raw comparator, etc

But I'd like to know if the general new structure works. Of course it's 
definitely a big time work in progress, but the comments really help.

Lastly, I'd like to know how this should interact with PrimitiveTuples. I still 
think there is a place for them (since SchemaTuples have to be generated on the 
front end but PrimitiveTuples do not), but the whole 
TupleFactory.newTupleForSchema thing is weird... I went with a 
TupleFactory.getInstanceForSchema(Schema) approach and liked it a lot more. 
another question is what to do when the Schema can't be generated... one option 
is to just return a tuple, and another is to fail out. IMHO we should fail out, 
and require people to ensure it's generatable, but I can see the argument 
otherwise. In general, for things like this, I think it's better to fail early 
and explicitly than to let people think they have a special Tuple when they 
don't. Philosophies may differ.


Summary
-------

This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing 
the Schema on the frontend, we can code generate Tuples which can be used for 
fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, and 
it's ~15% smaller serialized (heavily heavily depends on the data, though). 
Need to do get/set tests, but assuming that it's on par (or even faster) than 
Tuple, the memory gain is huge.

Need to clean up the code and add tests.

Right now, it generates a SchemaTuple for every inputSchema and outputSchema 
given to UDF's. The next step is to make a SchemaBag, where I think the 
serialization savings will be really huge.

Needs tests and comments, but I want the code to settle a bit.


This addresses bug PIG-2632.
    https://issues.apache.org/jira/browse/PIG-2632


Diffs (updated)
-----

  trunk/bin/pig 1310666 
  trunk/build.xml 1310666 
  trunk/ivy.xml 1310666 
  trunk/ivy/libraries.properties 1310666 
  
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
 1310666 
  trunk/src/org/apache/pig/data/BinInterSedes.java 1310666 
  trunk/src/org/apache/pig/data/FieldIsNullException.java PRE-CREATION 
  trunk/src/org/apache/pig/data/PrimitiveTuple.java 1310666 
  trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION 
  trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java PRE-CREATION 
  trunk/src/org/apache/pig/data/SchemaTupleFactory.java PRE-CREATION 
  trunk/src/org/apache/pig/data/Tuple.java 1310666 
  trunk/src/org/apache/pig/data/TupleFactory.java 1310666 
  trunk/src/org/apache/pig/data/TypeAwareTuple.java 1310666 
  trunk/src/org/apache/pig/data/utils/SedesHelper.java PRE-CREATION 
  trunk/src/org/apache/pig/impl/PigContext.java 1310666 
  trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 
1310666 

Diff: https://reviews.apache.org/r/4651/diff


Testing
-------


Thanks,

Jonathan


                
> Create a SchemaTuple which generates efficient Tuples via code gen
> ------------------------------------------------------------------
>
>                 Key: PIG-2632
>                 URL: https://issues.apache.org/jira/browse/PIG-2632
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jonathan Coveney
>            Assignee: Jonathan Coveney
>             Fix For: 0.11
>
>         Attachments: PIG-2632-0.patch, PIG-2632-1.patch
>
>
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing 
> the Schema on the frontend, we can code generate Tuples which can be used for 
> fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, 
> and it's ~15% smaller serialized (heavily heavily depends on the data, 
> though). Need to do get/set tests, but assuming that it's on par (or even 
> faster) than Tuple, the memory gain is huge.
> Need to clean up the code and add tests.
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema 
> given to UDF's. The next step is to make a SchemaBag, where I think the 
> serialization savings will be really huge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to