[jira] [Commented] (PIG-2632) Create a SchemaTuple which generates efficient Tuples via code gen

[email protected] (Commented) (JIRA) Mon, 09 Apr 2012 13:17:50 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13250139#comment-13250139
 ]

[email protected] commented on PIG-2632:
----------------------------------------------------

bq.  On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq.  >
bq.  
bq.  Jonathan Coveney wrote:
bq.      Thanks again for taking a look Julien! I think we are close to having 
the structure where we want it, there are just a couple of outstanding higher 
level points:
bq.      - How should PigContext be referenced?
bq.      - How should the intermediate compiled data be handle?
bq.      - What should the semantics of getting a SchemaTupleFactory be?
bq.      - Generating the SchemaTupleFactory so it can directly instantiate the 
given SchemaTuple
bq.      - What should and shouldn't be in the generated code
bq.      
bq.      And probably some more. I responded below, and hopefully we can keep 
honing in on something that looks good :)

- How should PigContext be referenced?
We can leave it as is for now. It sounds like it's too big a refactor to be 
done in this patch

- How should the intermediate compiled data be handle?
For generated java code, I would say that the classes are generated in a temp 
directory and we use a URLClassLoader on the Frontend to access them. On the 
backend the class are just added to the job jar.
Once we move to generated bytecode then don't need to write the bytecode to 
disk and can just define() classes in a custom classloader.
In both cases if we don't hold a reference to the classloader or the classes 
they will be collected once we are done using them.

- What should the semantics of getting a SchemaTupleFactory be?
There are multiple places this could be used:
* LoadFunc and EvalFunc: possibly through a getOutputTupleFactory() added to 
the base classes.
* Spill of bags
* Pig intermediary storage.

- Generating the SchemaTupleFactory so it can directly instantiate the given 
SchemaTuple
The only thing you need is a subclass with an instantiate() implementation that 
directly calls the constructor.

- What should and shouldn't be in the generated code
The generated code should only have the parts that depend on "field_"+i and 
related methods and switch statements. Everything else should be in the parent. 
the default: cases of the switch statements can call super.set...() to handle 
the default case with the "append" tuple.
This will make it easier to switch to bytecode generation.

bq.  On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq.  > 
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java,
 lines 135-136
bq.  > 
<https://reviews.apache.org/r/4651/diff/1-3/?file=100075#file100075line135>
bq.  >
bq.  >     instead of checking here if the tuple is generatable, the factory 
could default to the regular TupleFactory if the generation fails.
bq.  
bq.  Jonathan Coveney wrote:
bq.      I don't know that I like that semantic though. In this case, in the 
client code it's clear that they know that they either are or are not getting a 
SchemaTupleFactory... if we default to a regular TupleFactory, then we might 
think we are getting a SchemaTuple (and all the speed and memory benefits) when 
we actually aren't. There are two cases where we would ask for a 
SchemaTupleFactory and it wouldn't be available: a Pig bug in our 
implementation, or user error. Shouldn't pig fail out intelligently instead of 
giving them a Factory that doesn't do what they think?

It depends how the generated factory gets used. My idea was to make sure we 
don't break existing scripts and to have a fallback solution in case of 
problems. But if you have to explicitly use the new factory then maybe it is 
fine.

bq.  On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTuple.java, line 84
bq.  > 
<https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line84>
bq.  >
bq.  >     append...() methods should not be null
bq.  >     (part of your TODO list?)
bq.  
bq.  Jonathan Coveney wrote:
bq.      Not sure what you mean here

Sorry, I meant: should not be *public*

bq.  On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 64
bq.  > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line64>
bq.  >
bq.  >     try avoiding accessing the PigContext statically. Add it as a 
parameter and see if it can be passed from somewhere it is actually known.
bq.  
bq.  Jonathan Coveney wrote:
bq.      In the pig code that I looked at, this seemed by far the more common 
idiom (not that I love it, but it's pretty hard to get the PigContext in 
there). I can't really find a single example of code in this sort of path that 
doesn't get the PigContext statically. I could rewire Pig to try and make this 
possible, but do you think that it is worth it/do you see any examples in the 
code base that I missed where PigContext isn't called statically in this sort 
of case? There are some examples of more purely front end things, and that's 
how I'd pipe it in I suppose, but it seems like a big change for something that 
barely touches the codebase. I do think a JIRA that tackles this issue in pig 
is reasonable though... trying to decrease the amount of statically referenced 
jank. Await your thoughts.

If it requires too much refactoring let's not do it in this patch

bq.  On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 71
bq.  > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line71>
bq.  >
bq.  >     possibly we want to rename this and/or add something else for this 
file.
bq.  
bq.  Jonathan Coveney wrote:
bq.      Question about class files: does a class file have to have the same 
name as the class it contains? I know this is true for .java files with the 
java compiler, but not sure if it is true of class files. If it is true of 
class files, then we can't muck with the name. If it isn't, then ok.

yes the .class file should have the same name as the class it contains and the 
package name should match the path to the file.

if addScriptFile() is not specific to scripts it should be renamed.

bq.  On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 127
bq.  > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line127>
bq.  >
bq.  >     "org.apache.pig.generated." + classname ?
bq.  
bq.  Jonathan Coveney wrote:
bq.      Java question: if we're managing the compilation, does the compiled 
class have to be in org/apache/pig/generated (or whatever) in the file 
directory?

yes, the relative path to the class from the root given to the classloader 
(path or jar) should match the package of the class.

bq.  On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 138
bq.  > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line138>
bq.  >
bq.  >     "org/apache/pig/generated/" + classname + ".class" ?
bq.  
bq.  Jonathan Coveney wrote:
bq.      What's the win here? Or just general cleanliness?

So that you have your own namespace to avoid conflicts and so that package 
protected members are visible.
Actually, this should be the same package as your base class.

bq.  On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 140
bq.  > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line140>
bq.  >
bq.  >     we should probably not compile in the current dir.
bq.  >     Put it in the temp dir intead
bq.  
bq.  Jonathan Coveney wrote:
bq.      I will see if this is possible. Do you know if Pig has an official 
local temp directory? I found an hdfs temp directory flag, but couldn't find a 
local one.

You can ask java to give you temporary files. You can reuse the path for a 
temporary directory.

bq.  On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines 
265-269
bq.  > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line265>
bq.  >
bq.  >     stuff which is not calling a generated field directly should be 
pulled up.
bq.  
bq.  Jonathan Coveney wrote:
bq.      I don't know that I agree with this. Then we're going to have a bunch 
of really random methods that have no context getting called for no real reason 
(I guess so that the compiler will run it's checks instead of getting an error 
at compile time)? Some of the logic makes more sense to do this with (the 
comparison of appends above, for example), but do you think the win for highly 
specific pieces of code like this is worth it?

The idea is to make it a simple as possible for when we switch to bytecode 
generation. Agreed they should have meaningful names so that could be on a case 
by case basis. Use your best judgement :)

bq.  On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines 
392-396
bq.  > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line392>
bq.  >
bq.  >     the exception handling can be pulled up
bq.  
bq.  Jonathan Coveney wrote:
bq.      Same as above. Should it really be our goal to move everything humanly 
possible into SchemaTuple?

It will make bytecode generation easier. If it gets a little too extreme we can 
make exceptions.

bq.  On 2012-04-09 05:34:39, Julien Le Dem wrote:
bq.  > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 518
bq.  > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line518>
bq.  >
bq.  >     same with a mapping for the java object for the type
bq.  
bq.  Jonathan Coveney wrote:
bq.      Not sure what you want the code to look like here. If you want to 
explicitly do all the casts, it will make the generated code bigger, and would 
seem to contradict your suggestion to call as much from SchemaTuple as possible?

The unbox trick will not work in bytecode generation as you will have to 
explicitly provide the type.

- Julien

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4651/#review6784
-----------------------------------------------------------

On 2012-04-08 22:26:29, Jonathan Coveney wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4651/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-08 22:26:29)
bq.  
bq.  
bq.  Review request for pig and Julien Le Dem.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  This work builds on Dmitriy's PrimitiveTuple work. The idea is that, 
knowing the Schema on the frontend, we can code generate Tuples which can be 
used for fun and profit. In rudimentary tests, the memory efficiency is 2-4x 
better, and it's ~15% smaller serialized (heavily heavily depends on the data, 
though). Need to do get/set tests, but assuming that it's on par (or even 
faster) than Tuple, the memory gain is huge.
bq.  
bq.  Need to clean up the code and add tests.
bq.  
bq.  Right now, it generates a SchemaTuple for every inputSchema and 
outputSchema given to UDF's. The next step is to make a SchemaBag, where I 
think the serialization savings will be really huge.
bq.  
bq.  Needs tests and comments, but I want the code to settle a bit.
bq.  
bq.  
bq.  This addresses bug PIG-2632.
bq.      https://issues.apache.org/jira/browse/PIG-2632
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/bin/pig 1310666 
bq.    trunk/build.xml 1310666 
bq.    trunk/ivy.xml 1310666 
bq.    trunk/ivy/libraries.properties 1310666 
bq.    
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
 1310666 
bq.    trunk/src/org/apache/pig/data/BinInterSedes.java 1310666 
bq.    trunk/src/org/apache/pig/data/FieldIsNullException.java PRE-CREATION 
bq.    trunk/src/org/apache/pig/data/PrimitiveTuple.java 1310666 
bq.    trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION 
bq.    trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java 
PRE-CREATION 
bq.    trunk/src/org/apache/pig/data/SchemaTupleFactory.java PRE-CREATION 
bq.    trunk/src/org/apache/pig/data/Tuple.java 1310666 
bq.    trunk/src/org/apache/pig/data/TupleFactory.java 1310666 
bq.    trunk/src/org/apache/pig/data/TypeAwareTuple.java 1310666 
bq.    trunk/src/org/apache/pig/data/utils/SedesHelper.java PRE-CREATION 
bq.    trunk/src/org/apache/pig/impl/PigContext.java 1310666 
bq.    
trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 
1310666 
bq.  
bq.  Diff: https://reviews.apache.org/r/4651/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jonathan
bq.  
bq.

> Create a SchemaTuple which generates efficient Tuples via code gen
> ------------------------------------------------------------------
>
>                 Key: PIG-2632
>                 URL: https://issues.apache.org/jira/browse/PIG-2632
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jonathan Coveney
>            Assignee: Jonathan Coveney
>             Fix For: 0.11
>
>         Attachments: PIG-2632-0.patch, PIG-2632-1.patch, PIG-2632-3.patch
>
>
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing 
> the Schema on the frontend, we can code generate Tuples which can be used for 
> fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, 
> and it's ~15% smaller serialized (heavily heavily depends on the data, 
> though). Need to do get/set tests, but assuming that it's on par (or even 
> faster) than Tuple, the memory gain is huge.
> Need to clean up the code and add tests.
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema 
> given to UDF's. The next step is to make a SchemaBag, where I think the 
> serialization savings will be really huge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2632) Create a SchemaTuple which generates efficient Tuples via code gen

Reply via email to